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Zusammenfassung 



Das koUektive Verhalten von Systemen aus vielen wechselwirkenden Komponen- 
ten ist traditionell ein Hauptthema der statistischen Physik. In den letzten 
Jahren sind verstarkt auch okonomische Systeme wie z.B. Aktienmarkte unter- 
sucht worden, deren "Komponenten" (Menschen und Firmen) durch Agenten mit 
vereinfachten Entscheidungregeln (beispielsweise durch neuronale Netze) model- 
liert wurden. Von besonderem Interesse sind dabei auch die Zeitreihen, die durch 
solche Systeme erzeugt werden, und die Fahigkeit der Agenten, aus den Zeitreihen 
Informationen zu beziehen. 

Kapitel 2 dicscr Arbeit widmet sich dem Konzcpt der antivorhersagbaren 
Zeitreihen: zu jedem Vorhersagealgorithmus gibt es eine Zeitreihe, fiir die er voUig 
versagt. Die Eigenschaften dieser Zeitreihen werden fiir drei spezielle Algorithmen 
untersucht und mit denen von gut vorhersagbaren Zeitreihen verghchen. Aspekte 
von Interesse sind zum Beispiel die Lange von Zyklen bei diskreten Zeitreihen, 
chaotisches Verhalten bei kontinuierhchen Zeitreihen und die Unterdriickung von 
Korrelationen, auf die der Algorithmus empfindhch ist, bei antivorhersagbaren 
Zeitreihen. 

Der erste Algorithmus, der untersucht wird, ist das cinfache Perzeptron, bei 
dem die Dynamik des Gewichtsvektors eine Reihe von Aussagen iiber die Autokor- 
relationen der erzeugten Zeitreihe zulasst. Eine eventuelle praktischc Anwcndung 
zur Erzeugung binarer Zeitreihen mit mafigeschneiderten Autokorrelationsfunk- 
tionen wird elautert. 

Der zweite Algorithmus, ein kontinuierliches Perzeptron, stellt sich als kom- 
plizierte nichtlineare Abbildung heraus, mit hochdimensionalem Chaos und in- 
termittentem Verhalten. Nach einer Mean-Field-Rechnung zur Ermittlung der 
statistischen Eigenschaften der erzeugten Zeitreihe wird das Verhalten in Ab- 
hangigkcit von SystemgroBe und Verstarkungsparameter untersucht und Lyapu- 
nov-Exponenten und Attraktordimension ermittelt. 

Der dritte Vorhersagealgorithmus benutzt Boolesche Funktionen. Einige Ei- 
genschaften der erzeugten Zeitreihe lassen sich durch graphentheoretische Ansatze 
beweisen, z.B. die Lange und Anzahl von Zyklen. Die Erfolgsaussichten von gle- 
ichartigcn Vorhersagealgorithmen mit langerem oder kiirzerem Gedachtnis wer- 
den untersucht. 

Kapitel 3 beschreibt verschiedene Varianten des Minoritatsspiels, bei dem 
eine grossere Anzahl von Spielern sich unabhangig voneinander fiir eine von zwei 
Moglichkeiten entschcidcn soil und diejenigen gewinnen, die in der Mindcrhcit 
sind. Besonderes Augenmerk gilt den Varianten, die von unserer Arbeitsgruppe 
eingefiihrt wurden: eine korrekte und voUstandigere Betrachtung von neuronalen 
Netzen im Minoritatsspiel wird angestellt; die stochastische Strategie von Reents 
und Metzler wird vorgestellt und analytisch gelost; und es wird gezeigt, dass 
eine Verallgcmcinerung auf mehr als zwei Wahlmoglichkcitcn fiir alle gangigen 
Strategien moglich ist und qualitativ vergleichbare Ergebnisse bringt. 



Eine Verbindung zu Kapitel 2 wird dadurch geschaffen, dass bei vielen Vari- 
antcn in bcstimmtcn Paramctcrbereichen die Zeitreihe der Entscheidungen gena- 
hert werden kann durch die antivorhersagbare Zeitreihe eines einzelnen Algorith- 
mus, der die Mehrheit der Spieler reprasentiert. 

Kapitel 4 beleuchtet eine andere Verbindung zwischen neuronalen Netzen 
und Spieltheorie: hier versucht ein neuronales Netz, in einem Zwei-Spieler-NuU- 
summenspiel durch wicdcrholtcs Spiclcn und Anpassen der Gewichte eine gute 
Spielstrategie zu entwickeln. Eine Abwandlung von Hebbschem Lernen wird un- 
tersucht, die naherungsweise die Nash-Gleichgewichtsstrategie des Spiels finden 
kann. Die Eigenschaften dieser Kegel lassen sich fiir kleine Auszahlungsmatrizen 
analytisch angeben, fiir groBe Matrizen mit zufalligen Eintragen konnen Ab- 
schatzungen gemacht werden. 

Beim Lernen von Mustern mit Vorzugsrichtung (wie zum Beispiel Zeitreihen- 
mustern) muss dieser Algorithmus jedoch modifiziert werden. Die dann nahe- 
liegende Variante (die ebenfalls aus Hebbschem Lernen abzuleiten ist), stellt sich 
als Abwandlung eines bekannten Lernalgorithmus fiir Matrixspiele her aus. 
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Chapter 1 
Introduction 



"Neural Networks, Game Theory and Time Series Generation" - the title of this 
dissertation contains three fields of research that are so broad that bookshelves 
full of literature have been written on each one of them. Consequently, I will 
start with a disclaimer: this dissertation is not a textbook, and it cannot give 
exhaustive accounts of each field. Instead, it is my aim to present several projects 
which connect the three fields. Although more elaborate introductions to the 
ideas and formalisms needed to understand the projects will be given in the 
individual chapters, I will start with a brief overview over the fields and an 
outline of the dissertation. 

1.1 Neural Networks 

In an attempt to understand the workings of animal and human nerve systems, 
simple mathematical models of nerve cells (neurons) have been devised, which 
can be combined into artificial neural networks of considerable complexity. The 
simplest building block of these networks, the McCulloch-Pitts neuron [1], is in- 
teresting enough when studied in isolation. This network, which is then called 
a simple perceptron, can perform limited, but nontrivial, feats of storing, clas- 
sification, and learning of unknown rules [2, 3, 4]. It formalizes two feature of 
biological neurons: they receive signals from other cells via synaptic connections, 
and they are active if the added input exceeds some threshold, and inactive if it 
does not. 

For increasing complexity and realism, compared to the simple perceptron, 
two directions suggest themselves: one is to assemble more complex structures 
out of simple perceptrons, such as multi-layer feed-forward networks, associa- 
tive memory networks or recurrent networks [3]. This direction tends towards 
computer science and machine learning. The other direction is to incorporate 
more details from biological mechanisms [5, 6], such as explicit modeling of cell 
membrane potential and firing rates. While these models are hard to approach 
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analytically, they can be used to emulate biological pattern recognition prob- 
lems realistically [7, 8]. In this dissertation, however, the simplest architectures 
that are suitable to a given task - namely, simple and continuous perceptrons, 
and multi-class perceptrons - will be used. Understanding their interaction with 
themselves or other networks is a sufficient challenge. 

1.2 Time series generation 

One of the applications that neural networks have been put to is the prediction of 
time series [9, 10]. The idea here is that many physical systems generate a stream 
of obscrvablcs at regular time intervals (such as daily temperature measurements) 
that exhibit certain regularities. These properties of the time series can be used 
by a suitable algorithm to predict future values from a finite number of past 
values. Time series that can be predicted without error by some algorithm are 
those that can be generated by the same algorithm by feeding the predicted value 
back into the history that is used to make the next prediction. Therefore, it is 
helpful to look at the properties of a time series generated by an algorithm to 
learn about its prediction capacities. 

It has been pointed out in Ref. [11] that the success of a prediction algorithm 
depends completely on the properties of the time series that it is applied to, and 
that there are always time series for which a given algorithm fails completely. 
These sequences can again be generated by that algorithm by inverting its output 
and feeding it back to the history. Comparing these sequences to the ones that 
are well predictable for the same algorithm sheds more light on the capabilities 
of the algorithm. 

1.3 Game theory 

Game theory deals with situation in which two or more players must make deci- 
sions, and the payoff which each player receives depends on the combined deci- 
sions of all players. In such situations, being able to predict what the others are 
going to do is extremely useful, whereas making decisions that are predictable 

for others is not. 

Two game-theoretical problems will be treated in detail in this work: in two- 
player zero-sum games, both players pick one of their available options indepen- 
dently at the same time, and the gain of one player is always the loss of the other. 
As the groundbreaking studies of v. Neumann showed [12], this type of game has 
a unique equilibrium set of strategies for both players, which can be determined 
by rational calculations and beyond which no improvements for either player are 
possible. This allows to quantify the success of a given learning algorithm that 
develops a strategy from repeated playing, by comparing the achieved strategy 
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to the equilibrium strategy. 

In the second scenario that will be presented, the Minority Game [13], a 
large number of players pick one of two options, and those in the majority lose. 
Although randomly choosing one option would be the rational strategy, other 
prescriptions (collectively labeled "bounded rationality") where players try to 
predict the majority decision can lead to coordination among players, and thus 
improved average gains, compared to random guessing. On the other hand, 
attempts to avoid the predicted majority can lead to over-reactions, which result 
in herding behavior and dramatically reduced average gains. 

1.4 An outline of this work 

As mentioned, to understand the properties of prediction algorithms, it is helpful 
to study the properties of sequences that can be predicted well by them and 
especially those are predicted wrongly every time. In Chapter 2, 1 will do this 
for three prediction algorithms: simple and continuous perceptrons, and Boolean 
functions. Aspects of interest are properties of cycles (for discrete time series) 
and attractors (for continuous ones) , and especially the suppression of correlations 
that the prediction algorithm is sensitive to in antipredictable time series. 

Chapter 3 will present different aspects of the Minority Game, focusing on the 
variations of the game that were introduced by our research group, such as neural 
networks (Sec. 3.4 - this is a first link between neural networks and game theory) 
and the stochastic strategy presented in Sec. 3.6 (where, if a memory is included, 
the players can behave like a Boolean function with a learning algorithm). It will 
be shown that in Sec. 3.7 all presented strategies can be generalized to more that 
two options. 

I will point out, where applicable, under what circumstances the dynamics of 
the Minority Game generate a time series that is partly antipredictable for the 
individual players, and completely antipredictable for a predictor that represents 
the ensemble of players, thus combining game theory and time series generation. 

Chapter 4 presents a different aspect of connecting game theory and neural 
networks: there, a simple neural network tries to learn a two-player zero-sum 
game by adapting its weights and learning from experience. If random patterns 
are presented to the network, a modified Hebbian learning rule can lead to a 
good strategy and, under some circumstances, even converge to the equilibrium 
strategy. 

If biased patterns are used, the learning rule has to be modified. The modified 
rule turns out to be a stochastic variation of a well-known learning algorithm for 
matrix games. 
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Chapter 2 

Antipredictable time series 



2.1 Prediction of time series 

A time series is a sequence of values x*^, . . . with ti < t2 < ■ ■ ■ ■ Often, the 
values are taken at regular time intervals, and the times ti,t2, ■ ■ ■ ■ can be denoted 
by integer numbers 1,2, ... . In most cases that are of interest to physicists, time 
series are observables of a physical system or variables of a mathematical model, 
and one tries to characterize the system by studying the statistical properties 
of the time series. However, in many cases it is of practical interest to predict 
the time series, i.e., make good guesses about the future values a;*+^, x*^^, . . . 
from knowledge about a finite number of past values * [9, 14]. For 

example, considerable effort goes into the prediction of the weather (temperature, 
rainfall etc.), and many amateur and pro stock brokers would love to be able to 
predict the future of stock prices. 

In the absence of information on external factors which might influence the 
time series (like new economic data that allow guesses about stock prices, or 
ecological events that influence the climate), a prediction algorithm can only 
take a finite stretch of the time series as input data and produce a guess based 
on this data. In other words, a prediction algorithm is a function g{x, w) that 
maps an input vector x (whose components are the M past values of the time 
series, which may be binary or discrete) onto a (continuous or discrete) output. ^ 
The function may depend on internal parameters w, such as the weight vectors 
in a neural network. 

In the following sections, I will concentrate mainly on discrete time series 
X* e {—1,1}, switching to the binary notation e {0,1} where it is more 
convenient. For discrete time series, the concept of accuracy is easy to define: 
the error rate is simply the percentage of predictions that do not agree with the 

^In game theory, "prediction" can mean predicting the probabihty distribution of possible 
outputs instead of giving a single most probable value [15]. I will not use this definition, which 
is diflacult to apply to systems where only one of the possible realizations of randomness is 
observed. 
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Figure 2.1: A simple representation of a prediction algorithm: external inputs 
X arc fed into a function g{'x,w). Internal parameters w are modified after 
comparing prediction and reality x^~^^. 

time series. In some cases, continuous or multi-valued discrete values will be 
considered as well. 

Adaptive prediction algorithms are capable of changing their internal param- 
eters based on a comparison between the time series and their prediction of it. 
One reasonable demand is that a prediction algorithm docs not adapt its pa- 
rameters as long as it is 100% accurate, and adapts if there are discrepancies 
between prediction and reality - this is for example realized in gradient-descent- 
based learning algorithms, where the norm of the update that is added to the 
parameters is proportional to the error. 

Another reasonable assumption is that the predictor is deterministic: even 
though the model to be predicted may contain noise, adding noise to the predictor 
is unlikely to make a prediction better (leaving aside concepts like stochastic 
resonance [16]). 

The time series that is to be predicted is generated by some system that can 
be arbitrarily complex, and can include any amount of randomness. The values 
of the time series (which represent observables like temperature, stock values 
etc.) reveal some information about the state of the system. For example, in 
deterministic chaotic systems with a finite number m of degrees of freedom, the 
full information about the state of the system is contained in a times series of at 
most 2m values of a suitable observable [17]. 

The prediction algorithm is then a more or less crude model of the system 
that generates the time series, as sketched in Fig. 2.1. Adjusting the parameters 
can improve the fit between the model and reality, or, to speak in the language 
of neural network theory, the overlap between "student" and the "rule" [18] or 
"teacher" [19]. 
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Figure 2.2: The next step: the system that generates the time series is included 
in the consideration. The predictor models some (hopefully relevant) properties 
of the system, and tries to improve the fit. 
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One could assume that, if the prediction algorithm has structural similarity 
to the generator of the time series, and if the time series is largely deterministic, 
the algorithm will predict the time series with fair accuracy after some training 
time. However, it was pointed out in Ref. [11] that for any algorithm that makes 
a binary prediction, there exists a generator of the same degree of complexity 
that generates a time series which will make the predictor fail every single time. 
This generator can be constructed easily: it is an exact copy of the predictor, 
including the internal parameters, the adaption algorithm and the random num- 
ber generator, that takes the same input data and inverts the output. As the 
example of the antipersistent binary time series will show later (see. Sec. 2.4.3), 
in some cases this scheme will also fool predictors that start with a different set 
of initial parameters, if they "forget" their initial state quickly enough. 

One interesting thing about the antipredictable sequence generated in this 
fashion is that it is completely deterministic. In the same fashion, deterministic 
sequences can be designed that make the predictor fail every second, third, etc., 
prediction [11]. 

In the framework outlined above, time scries that can be predicted without 
error by an algorithm are those where the output of the algorithm is exactly the 
next value of the time series. The time series can thus be constructed (without any 
external information) by feeding the newly generated bit back into the pattern. 
Analogously, antipredictable sequences are those generated by the algorithm, 
inverting the bit and feeding it back. In the following, prediction and generation 
arc thus very much interchangeable, and in order to study the properties of time 
series that are predictable or antipredictable, I will actually study time series 
generated by the algorithm. 

The concept of antipredictability may seem highly artificial. However, it can 
turn out to be relevant if the predictors are part of the system they are trying 
to predict (see Fig. 2.3). For example, the concept of the Minority Game that is 
explained in Chapter 3 has a strong element of self-defeating prophecy. 

The following sections will compare the properties of predictable and an- 
tipredictable sequences for two simple algorithms: a simple neural network, and 
a Boolean function. Sections 2.2 and 2.3 are based on from Refs. [20] and [21], 
whereas Sec. 2.4 was presented in Ref. [22]. These differences will shed some 
light on the considered algorithms themselves, and the features in the time series 
that they are sensitive to, giving a better intuition about the applicability of an 
algorithm to a given problem. 
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Figure 2.3: Sometimes, the predicting algorithm is part of the system that it is 
trying to predict, together with other algorithms and a complex environment. 



2.2 Simple Perceptrons: the Confused Bit Gen- 
erator 

2.2.1 Sequence generation by a static perceptron: the Bit 
Generator 

The first example of a prediction algorithm that 1 am going to present is the 
simplest model of a neural network, the simple perceptron [2, 3]. Inspired by 
biological neurons [1] , this model has received much attention, and in spite of its 
simplicity, it shows many interesting features (see Ref. [18] for an overview). 

In analogy to a biological neuron that receives impulses from other neurons 
through synaptic links and either starts firing impulses itself or stays passive 
depending on the received inputs, the perceptron performs a weighted sum over 
a set of inputs Xj, i = 1, . . . ,M. Each input is multiplied with a synaptic weight 
Wi. The output is either +1 or —1, depending on whether the weighted sum of 
inputs is larger or smaller than 0. 

In a geometrical interpretation, both the weights and the input data represent 
M-dimensional vectors w and x, and the output is the sign of the scalar product 
between the vectors. 

One way to generate a time series using a perceptron (or, for that matter, any 
other feed-forward neural network [23]) is to use an M-bit window of the time 
series as input and continue the time series with the output on that pattern: 



For the simple perceptron with binary output (and, correspondingly, a binary 



M 





1=1 
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time series) and fixed weights, this system, which was named Bit Generator (BG), 
was stTidied in Refs. [24, 25, 26]. It is a deterministic dynamical system with a 
finite state space: there are only 2^ possible histories. Once a history appears the 
second time, the system is on a cycle. Simulations show that cycles have a typical 
length I < 2M; their average length (/) (averaged over initial conditions) increases 
polynomially in M [25]. Typically, cycles are dominated by one short sequence 

that is interrupted by "defects" , such asH 1 1 h + H 1 1 — ... . The 

wavelength of the underlying short sequence usually corresponds to a wavelength 
that is strongly represented in a Fourier decomposition of the weight vector. 

As long as the weight stays fixed, the predictable sequence for a given w is 
precisely the anti-predictable sequence for — w. Things get more interesting if 
the weight vector becomes adaptive. 

2.2.2 Hebbian Learning: the Confused Bit Generator 

In order to find regularities in a set of data, the weights of a neural network 
have to adapt to the data. This can occur either offline (all patterns and desired 
outputs are known and arc processed at the same time) or online (patterns are 
shown to the network one by one and then discarded). Since the patterns and 
the output to be learned are generated by the predictor, offline learning makes 
little sense. Therefore, online learning will exclusively be considered from now 
on. 

Online learning algorithms for perceptrons come in many variations, ranging 
from simple to elaborate [3, 27, 28, 29, 4]. However, most boil down to the 
same principlerthe updated weight vector w*"^^ is a linear combination of the 
previous weight vector w* and the pattern x*. This seems natural for symmetry 
reasons, since these two vectors are the only preferred directions in a possibly 
high-dimensional space. ^ 

The coefficients of the linear combination can depend on the desired output 
(in our case, the next step of the time series, x*+^), the so-called hidden field 
/i* — w*-x*, a learning rate 77, and a few other quantities, which should be 
accessible to the neural network. The two most basic learning algorithms are the 
Hebbian rule 

w*+^ = w* + ^xV+\ (2.2) 

and the Rosenblatt rule, 

w*+^ = w* + ^x*x*+^e(-x*+^sign(/i*)). (2.3) 

In the latter case, the correction step is applied only if the output of the percep- 
tron and the desired output disagree. Since this is always the case for a completely 

^The interpretation of the pattern as a time series, however, breaks the symmetry between 
input dimensions: it may be reasonable to give more weight to the more recent past. Sees. 
2.2.4 and 2.2.5 provide examples for this. 
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antipredictable time series, both rules amount to the same in this case. The per- 
ceptron which always learns the opposite of its own output was studied first in 
Ref. [11], and examined extensively in Refs. [21] and [20], where it was named 
Confused Bit Generator (CBG). A very similar system was studied in Ref. [30]. 
The CBG is defined by the equations 

= -sign(x*-w*); (2.4) 
w*+^ = w* + (r//M)a;*+^x*. (2.5) 

This is a deterministic dynamic system with 2M variables, instead of M vari- 
ables in the BG. Indeed, the dynamics of the weight vector are very helpful in 
understanding the properties of the generated time series. 



2.2.3 Dynamics of the weight vector 

Geometrically speaking, w does a random-walk-like movement on an M-dimen- 
sional cubic lattice: in each component of the weight vector, the learning step 
has a value of ±rj/M. Although the weight components are real numbers, once 
an initial state w° has been defined, the components Wi can only take values 
± nri/M, with n G Z. 

Each learning step has a negative overlap with the current w. The norm of 
the weight vector fiuctuates around an equilibrium value that can be estimated 
by taking the square of Eq. (2.5) and applying the usual formalism for online 
learning [29, 31]. Since the patterns are windows of the time series and are 
hence generated by a complex dynamical process, the required averages over 
scalar products between the weight and the pattern cannot be done exactly. 
However, the generated time series looks sufficiently irregular at first glance (no 
prominent frequency, no short-term repetitions) that one can try replacing x with 
a random vector whose components are independent random variables of mean 
and variance 1, and see how far that approximation carries: 

(w*+i-w*+^ - w*-w* ) = -^(x*-w*sign(x*-w*)) + ^(x*-x*). (2.6) 

Introducing a time scale a with do; = 1/M and averaging over x, this becomes a 
deterministic differential equation for the norm w = |w| in the thermodynamic 
hmit M ^ oo: ^ 

The attractive fixed point of this equation isw= ^^tt/St] fti 0.6267?7. The learning 
rate r] thus only sets a length scale, but does not infiuence the behavior of the 
system once the weight vector has reached its fixed-point length. 

^This limit will be tacitly assumed every time a differential equation is written down; it 
usually gives good results even for moderate M (on the order of M ^ 20.) 
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However, using the time series generated by the perceptron as patterns, sim- 
ulations give a sUghtly different value of w ~ 0. 56677, independent of M (this was 
already observed in [11]). Two possible mechanisms for this deviation suggest 
themselves: first, the time series windows generated by the CBG are not isotrop- 
ically distributed on the M-dimensional hypercybe; some patterns are more likely 
to appear than others (see Ref. [21]) for details). Second, correlations between 
the outputs and the patterns might be the cause of the deviations. This can be 
checked by using patterns drawn randomly from an anisotropic distribution that 
resembles the one generated by the CBG. Since this leads to values of w com- 
patible with the result for random vectors, one must conclude that correlations 
between outputs and patterns are responsible for the deviation. 

Using the approximation of random patterns, the autocorrelation of the weight 
vector can be estimated as well: 

(w*-w*+0=^^exp(-l^). (2.8) 

Of course, this does not refiect the fact that the CBG is a deterministic system 
with a bounded number of states, so cycles are inevitable, and the weight vector 
must return to its original point later (i.e., the autocorrelation of the weights will 
return to at multiples of the return time). The dynamics of the weights can 
be linked to the the autocorrelation function Cj of the sequence, defined by 

t 

C^ = J2x'x'^^, (2.9) 
1=1 

where t is the number of patterns summed over. Simply add t update steps 
according to Eq. (2.5): 

t 

w] - w° + J2iv/M)x'x'-^ = w° + {v/M)Cl (2.10) 

i=l 

Each value Cj for 1 < j < M corresponds to the distance of the weight vector from 
its starting point along one axis in the M-dimensional weight space, measured 
in units of 77/M. This point is important and will be exploited in the following 
paragraphs. Note that Eq. (2.10) holds for any perceptron that learns a time 
series following the Hebb rule, regardless whether the time series is predictable, 
antipredictable or anything in between. This means that in the long run, the 
perceptron is only sensitive to the first M values of the autocorrelation function 
of the considered time series. The stronger the autocorrelation of a sequence is, 
the more accurate the prediction will be. 

In the CBG, the norm of the weight vector, and therefore the autocorrelation 
function of the generated sequence, is suppressed: as described, the norm of w 
stays bounded. Following Eq. (2.10), each component of the autocorrelation 
function stays bounded as well, instead of growing with ^/a as it would for a 
random sequence. 
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Figure 2.4: Autocorrelation function of the time series generated by the CBG, 
averaged over i = 2 x 10^ for M = 50 (taken from Ref. [21]). The first M 
components are suppressed by the dynamics of the weights. 

2.2.4 The Bernasconi model 

Bit series with low autocorrelations are of interest in mathematics and have ap- 
plications in signal processing [32]. On one hand, this allows to apply findings 
on these sequences to the CGB. On the other hand, it is interesting whether the 
CBG generates sequences with autocorrelations significantly lower than for ran- 
dom series. Two measures that indicate low autocorrelations are commonly used 
in the literature: for periodic sequences of length I, an energy function (which is 
studied in the so-called Bernasconi model for periodic boundary conditions [33]) 
can be defined by 

H.-E(cy-i:(t-'-'^') ■ (2-11) 

j=l j=l \i=l / 

Results on the ground states of this Hamiltonian can be found in [34]. By trial 
and error, initial conditions for the CBG can be found which yield cycle s slightly 
larger than 2M, for which all value of Cj except one arc 0. However, even for the 
best sequences we found in long simulation runs. Hp was larger than the known 
ground state energies by at least a factor of 2. 

The original model does not use periodic boundary conditions: in a sequence 
of length p, only the sum over p — j different terms with a lag of j can be 
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calculated. The energy for aperiodic sequences is therefore given by 



Hap = E(cr)' 



(2.12) 



(note the summation limits). The so-called merit factor F introduced by Golay 
[35] is defined by 

(2.13) 



2H, 



ap 



A merit factor of 1 is expected for a random sequence; lower autocorrelations yield 
higher F. The theoretical limit for large p is conjectured to be about F = 12 
[33, 36] , whereas optimization routines such as simulated annealing typically find 
sequences with 5 < F < 9 (see [37] and references therein) and exact enumeration 
for j9 < 60 suggests limp^^o-^ = 9.3 for the optimal sequence [38]. 

To analytically estimate the merit factor of sequences generated by the CBG, 
one can solve Eq. (2.10) for Cf and use the autocorrelation of the weights given 
by Eq. (2.8): 



{{Cpf) = '-^{wf + wf - 2w]w^) = X 1 - exp ( 



TT 



n 



M 



TT 



(2.14) 



The energy can be expressed as a sum or approximated by an integral in continu- 
ous variables a = p/M and f3 = j/M. Since Eq. (2.14) only holds for 1 < j < M, 
Cj ^ = p — j must be used for j > M. One gets the expression 



ap 



p-1 



exp 



4 p- J 

TT M 



M2^(l-exp(-(4/7r)(a-/3)))d/3 



.71 



TT 



a — 



1 — exp a 

TT 



for j < M and 



Hap = M' 



4 V 4 



exp 



a 



for j > M. 



exp( a] 

71 



(2.15) 



(2.16) 



The corresponding merit factor is compared to simulations in Fig. 2.5: Eqs. 
(2.15) and (2.16) give qualitatively correct results, but differ from the observed 
values by roughly 10%. The feedback mechanisms of the CBG cause a faster 
decay of Cj than predicted for random patterns. 

A simple extension of the dynamics allows to slightly manipulate this result: if 
each weight component gets an individual (positive) learning rate, one can write 
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Figure 2.5: Merit factor F of sequences generated by the CBG, as a function 
of scaled sequence length p/M, compared to Eqs. (2.15) and (2.16). Error bars 
denote the standard deviation of F for M = 100, not the standard error. 



down an update equation for each component that separates the influence of that 
component on the output from all others. A short calculation (again under the 
assumption of random patterns, which has yielded qualitatively correct predic- 
tions so far) then shows that the mean square value of that weight is proportional 
to its learning rate: 




8M2 



(2.17) 



Although Eq. (2.17) looks like a regular on-line learning equation, it does not 
carry quite as far: wf is not a self-averaging quantity [39], i.e., the variance of 
wf does not go to zero in the thermodynamic limit, and Eq. (2.17) only makes a 
statement about the long-time average. 

What it does state is that a component with a higher learning rate has, on 
the average, a larger weight and thus a stronger influence on the output. This 
also leads to a faster decay of the corresponding segment of the autocorrelation: 



TT 



exp 



(2.18) 



Returning to the Bernasconi problem, the search for a minimal Hap can be written 
as an optimization problem in the continuous function r]{(3), where r]j — r]{j /M). 
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Solving this problem with a variational approach, one finds that it is sensible to 
give the last 41% of the weights a learning rate and norm of zero, and increase 
the learning rate continuously towards components with smaller indices. Unfortu- 
nately, even this optimization does not improve the merit factor beyond F = 1.74 
in theory and (F) = 1.86 in simulations. This is still a lot worse than the results 
of other optimization methods [33, 37] , so the CBG is not a competitive generator 
of low-autocorrelation sequences. 

The limited use of the CBG in generating sequences with a high merit factor 
may be related to phase space arguments: as seen in Sec. 2.2.6, the CBG can 
still generate exponentially many different time series depending on initial con- 
ditions, whereas there are very few sequences with the highest achievable high 
merit factors (see [40] for the density of states with cyclic boundary conditions). 
The mechanism of the CBG allows for manipulation of the autocorrelation func- 
tion only if the constraints on the desired sequence are not too strong, such as 
suppressing all of the elements of Cj on a short time scale. On the other hand, 
choosing a given shape for long-time averages of Cj still allows for many realiza- 
tions of the sequence. 

Although it is not very useful in a straight application to the Bernasconi 
problem, the Confused Bit Generator still has some interesting possibilities: 



2.2.5 Shaping the autocorrelation function 

In some cases, it may be interesting to generate a time series whose power spec- 
trum has a specific shape in the long-time limit. Using Eqs. (2.17) and (2.10) in 
the limit where ~ and with non-negative learning rates, one can obtain the 
inverse relation between the square of the autocorrelation function Cf and the 
corresponding learning rate r]j 

(Cf) = 1^^^. (2.19) 
8 rjj 

Thus, just about any desired shape of the square autocorrelation function is 
achievable by using the appropriate profile for r]j, which can be extracted from 
Eq.(2.19). 

If one is not only interested in the shape, but also the norm of the autocorrela- 
tion function, one can influence this by distorting the output with multiplicative 
noise: if there is a probability of p/ for flipping the output, 

^t+i ^ f sign(x-w) with probability pf ,^ ^q) 

I — sign(x-w) with probability 1— p/ ' K ■ ) 

the learning step has a positive overlap with the weight with probability pf, 
leading to a larger norm w in the stationary state. For homogeneous learning 
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rates, Eq. (2.7) now reads 

-=^-,(2p,-l) + l-, (2.21) 

and the fixed point (which only exists for < 1/2) is correspondingly 

w = M — ^ — . (2.22) 
V 8 1 - 2p/ ^ ^ 

However, Eq.(2.10) still holds - and 

increases accordingly. The same factor 
of 1/(1 — 2pj^) also appears in Eq.(2.19) if the calculation is repeated for individual 
learning rates. 

If it is desired to enhance certain correlations rather than suppress them to 
a lesser or greater degree, it is possible to give some components a learning rate 
of and a fixed norm. The sign of that weight component determines whether 
the correlation is positive or negative, the norm determines the strength of the 
correlations. 

Does all this work in practice? Within certain bounds, it does. However, 
the autocorrelation function of a sequence cannot be shaped arbitrarily [41], and 
imposing strong constraints on some components of C leads to a strong violation 
of the assumption of random patterns that underlies the calculations. The results 
are not easy to calculate, so some trial-and-error is required to get the desired 
result . 

Furthermore, for each individual set of initial conditions, the autocorrelation 
function will show significant deviations from the calculated profile, and good 
agreement is only reached when averaging over sufficiently many initial condi- 
tions. Fig. 2.6 shows two examples of autocorrelation functions shaped with 
exponential and sinusoidal profiles, together with the theoretical profile (the pro- 
portionality constant was fitted). 

Concludingly, simulations show that with some twiddling, the CBG could be 
used as an alternative mechanism for generating colored binary sequences using 
local rules instead of nonlocal mechanisms such as Fourier transforms. 

2.2.6 Distribution of cycles 

Just hke the BG with fixed weights, the CBG is a deterministic system with a 
finite number of states, and thus has to fall into a cycle eventually. The distribu- 
tion of cycle lengths / was studied in some detail in my diploma thesis [21]. For 
completeness, 1 will repeat the essential results here. 

The connection between the autocorrelation function and the weights that 
was exploited in the previous sections can be used to give a lower bound on 
cycle lengths of the CBG: since the weight has to return to its original position 
after I steps, the first M components of the autocorrelation function Cj have to 
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Figure 2.6: Mean square autocorrelation function after t — 1000 steps, generated 
with a CBG with M = 100 and the learning rate profiles rjj = exp(— 2j/M) (plot 
(a)) and rij = 1/(0.01 + | sin(27rj/Af )|). Dashed lines show the profile predicted 
from Eq. (2.19), with a proportionality constant fitted to the data. Simulations 
averaged over 1000 runs. 



vanish (see Eq. (2.10)). By renaming the indices, one can see that the last M 
components Ci_j must vanish as well. That would mean that for I < 2M, all 
components of Cj would be zero. There is good mathematical evidence [34] that 
this is not possible for M > 3. Furthermore, to get = for any j, I must be 
divisible by 4. 

Extensive simulations indeed show that the lower bound / > 2M holds (with 
the exception of M = 2). However, the shortest cycle lengths that are observed 
are the smallest multiples of 4 larger than 2M. The average cycle length grows 
exponentially with M: (l) oc 2.2-^. 



2.2.7 Summary of the CBG 

The Confused Bit Generator generates the time series that is antipredictable for a 
perceptron, and it is hardly surprising that the properties of that sequence differ 
from one that is easily predictable: while a perceptron with Hebbian learning is 
sensitive to the autocorrelation of a time series, the CBG suppresses the first M 
bits of the autocorrelation function - as many as it can, with a memory of M 
time steps. 

This suppression can be linked directly to the dynamics of the weights. These 
can in turn be calculated under the assumption of random patterns, which allows 
to estimate quantities like the merit factor of the generated sequence. Unfor- 
tunately, the CBG is not a very good source of low-autocorrelation sequences; 
however, individual learning rates for different components allow for a manipu- 
lation and customized shaping of the autocorrelation function. 
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The suppression of the autocorrelation function points to another difference 
between the CBG and the BG: the sequence generated by the CBG no longer 
has a dominant Fourier component. The Fourier transform of the weight vector 
is no longer a very meaningful quantity either, since it changes constantly as the 
vector moves. 

The CBG has a significantly larger number of accessible states than the BG 
with equal M . However, while the average cycle length of the BG grows polyno- 
mially in M, and has a typical value of 2M, [l) grows exponentially for the CBG, 
and has 2M as a lower bound. 

2.3 Continuous Perceptrons: the Confused Se- 
quence Generator 

2.3.1 Continuous perceptrons with fixed and 
adaptive weights 

The perceptron with binary output can readily be generalized to continuous out- 
put by replacing the sign function with a sigmoid function, such as the error 
function. This introduces a new parameter, the amplification j3: 

^(x,w) = erf(/3x-w). (2.23) 

Just like the simple perceptron, the continuous perceptron can be used to generate 
a time series: 

M 

= eii{j3 Wix'"^^). (2.24) 

i=l 

This system (the Sequence Generator or SGen) was studied in Refs. [23, 42, 43]. 
In the typical case, depending on the amplification, the system either converges 
to the trivial solution = or, after a short chaotic transient, relaxes into a 
stable limit cycle (see. Fig. 2.7). 

A careful investigation shows that there are small areas in parameter space 
(which is spanned by w and /3) that lead to chaotic behavior. Since these areas 
are themselves fractals, and the appearance of chaos is therefore sensitive to small 
variations in the parameters, the term fragile chaos was coined to characterize 
the system [23]. 

It is not quite obvious how to generalize the Confused Bit Generator to con- 
tinuous output: there is no unique value of the prediction that is "wrong", since 
any value that is different from the actual value of x is more or less wrong. The 
following simple prescription was chosen to keep the dynamics as close as pos- 
sible to that of the CBG, and especially to keep the connection between the 
weight vector and the autocorrelation function of the generated time series. The 
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Figure 2.7: Typical return map of an SGen: after a short transient (scattered 
points) the system enters a quasi-periodic hmit cycle. 

continuation of the time series is thus simply taken to be the negative of the 
perception's output, and the learning step is proportional to the desired output 
(and therefore, to the difference between the desired and the real output): 

^t+i ^ -erf(/?x*-w*); (2.25) 
w*+^ = w* + (7?/M)x*+^x*. (2.26) 

This system was introduced in Ref . [20] , where it was named Confused Sequence 
Generator (CSG). 

2.3.2 Mean-field solution 

Similar to the CBG, the weight vector of the CSG does a directed pseudo-random 
walk near the surface of a hypersphere of a radius w. Unlike the CBG, the 
length of the learning steps, which determines w, is not fixed, but depends on the 
magnitude of the output, which in turn depends on w and the outputs in previous 
time steps. To find an approximate solution to this self-consistency problem, I 
will first ignore correlations between patterns and weights and treat the patterns 
as random and independent. 

In this approach, the hidden field /i = w-x is a Gaussian random variable of 
mean and variance w'^S'^, where = {x^^)t is the mean square output of the 
system. 
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The norm w is found by taking the square of (2.26): 



w'^ = w'-^^ - ^x*-w*erf(/3x*-w*) + ^5*V-x*, (2.27) 

and averaging over the input patterns. The self-overlap x-x is on the average 
MS^, so the fixed point of w is given by 

2{h erf(/3/i)) = r)S\ (2.28) 

The average on the left hand side can be evaluated and leads to 

- , = rjS"^, or (2.29) 

^1 + 2/32^2^2 



^ 16/3 ■ i ■ J 

Let us now turn to 5*2. The probability distribution of 5* itself is rather 
awkward, since it involves inverse error functions, and its slope diverges at = 
±1. However, 5*2 can be easily calculated by using the distribution of h: 



dh 



2 . / 2^W^, 
= - a^csm — — -— — - . (2.31) 



Plugging w'^{r], S'^) from Eqs. (2.30) into (2.31) and solving numerically, one 
obtains a self-consistent solution for 5*2. A closer look at the equations reveals 
that if a new quantity 7 = ?7/3 is introduced, only 7 enters into the equation for 
5*2, and w"^ is of the form = rfw'^i^), so only one curve must be considered. 
This is intuitive, since a higher 77 eventually leads to a higher w, which has the 
same effect on 5*2 as having a smaller w, but multiplying w-x with a higher factor 
(5. 

The map defined by Eqs. (2.25) and (2.26) always has the trivial solution 
X = 0. Only for a sufficiently high 7 > 7c are the outputs high enough to sustain 
a non- vanishing solution. Note that x = is always an attractive solution for all 
7 < 00, but its basin of attraction becomes smaller for larger 7. 

The numerical solution of Eqs. (2.30) and (2.31) shows that the system un- 
dergoes a saddle-node bifurcation at 7c ~ 5.785, which is in good agreement 
with simulations. Above 7c, two new fixed points exist, only one of which is 
stable. While for 152(7) excellent agreement is found between theory and simu- 
lation (see Fig. 2.8), ^2(7) shows quantitative differences which are caused by 
correlations between x and w: the mean square overlap ((x-w)^) turns out to be 
(1.22 ± 0.01)it;25'2 instead of w'^S'^ as expected for random patterns. This causes 
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Figure 2.8: Mean-field solution of the CSG (Eqs. (2.30) and (2.31)), compared 
to simulations with M — 400. 

a factor of roughly 0.82 between the theoretical and observed value of , as 
seen in Fig. 2.8. It is the same factor that is is found in the CBG, and likely 
caused by the same mechanisms. 

For large 7, S'^ goes to 1 (as it should, since the system is identical to the 
CBG if 7 = 00), and the theoretical prediction for w goes to ^/n/Sr], just hke in 
the CBG. 

2.3.3 Autocorrelation function 

The relation Eq. (2.10) that links the autocorrelation function to the weights in 
the CBG still holds for the CSG. Since the weight vector is bounded in the CSG as 
well, the same argument can be given for the suppression of the first M values of 
the autocorrelation function. Correspondingly, Cj/p is almost indistinguishable 
from that of the CBG shown in Fig. 2.4. 

In principle, this allows to play the same games as in Sec. 2.2.5 to shape the 
first M components of C^. However, the practical need for a continuous time 
series with a rather odd probability distribution and a custom-tailored autocor- 
relation function is probably small. 
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2.3.4 Cycles and attractors 

The CSG can be seen as a nonlinear mapping that maps the vector x* ® w* 
onto x*+^ © w*+^. Again, this is opposed to the SGen with static weights, where 
the components of the sequence window are the only dynamic variables, and in 
analogy to the difference between the Bit Generator and the CBG. The only 
relevant control parameter of this mapping is 7. 

Since both the sequence and the weights now live in a high-dimensional space 
of real numbers, the CBG can display a wide variety of behaviors, depending on 
M and 7: 

For 7 < 7c, the zero solution is the only attractor, and the system will quickly 
reach x* = and stop developing. 

For 7 slightly above 7c, an irregular-looking time series with the distribution 
and variance that were calculated in Section 2.3.2 and displayed in Fig. (2.8) is 
generated. However, the zero sohition is still attractive, and after some time the 
system will drift close to it and stay there, i.e. the irregular behavior represents 
a chaotic transient rather than a stable chaotic attractor. 

The survival time on the transient increases dramatically with increasing M 
and 7. It is hard to decide from numerical results whether the average survival 
time (ts) diverges with a power law {{tg) oc I7 — 7^1^"), as one usually finds in 
scenarios where a chaotic transient becomes a chaotic attractor [44], or whether 
{tg) increases exponentially with 7. In either case, the system shows chaotic 
behavior for sufficiently long times to get stable numerical results - for example, 
for M — 20 and 7 = 7.0, the average survival time is on the order of 10^ steps. 

If 7 is larger than some critical value that depends on M, the chaotic transient 
can eventually end in a cycle that is related to a possible cycle of the discrete 
CBG. "Related" means that ,t* in the CSG is very close to ±1 and that clipping 
the sequence to the nearest value of ±1 would give the equivalent attractor of 
the CBG. More different cycles become stable with higher 7; however, the cycle 
lengths are usually of order 2M - short cycles are apparently more likely to 
become stable than ones whose length is of order 2^^. A possible explanation for 
this is probabilistic: as Sec. 2.3.5 will show, stability is possible only if the transfer 
function is almost saturated, which only happens if the pattern and weight have 
a high overlap. It is easier to find a short cycle for which this is fulfilled for every 
pair of pattern and weight, than a long cycle. 

At amplifications 7 slightly below the lowest 7 for which the first cycle be- 
comes stable for for a given M, intermittent behavior is observed: both x* and 
w* stay near a cycle for an extended number of steps (typically several thousand 
steps for M — 6) before returning to chaotic behavior for a similar time. An 
example of this is given in Fig. 2.9. 
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Figure 2.9: Example of intermittent behavior for M = 6, 7 = 8. From top 
to bottom: output norm of weights w* and largest "one-step Lyapunov 

exponent" max(ln(|A|) (see Section 2.3.5). 



2.3.5 Chaotic dynamics and Lyapunov exponents 

The term 'chaotic' was used in a loose meaning in Sec. 2.3.4 to describe the irreg- 
ular time series generated by the CBG. However, the system is in fact chaotic in 
the strict sense, i.e., small deviations in the initial conditions grow exponentially. 



Propagation of perturbations in one time step 

The sensitivity of trajectories of the map given by Eqs. (2.25) and (2.26) to small 
changes in the initial conditions can be tested by calculating the eigenvalues of 
the Jacobi matrix 

dx*- dw 



M' = 1 . (2.32) 

This is to be understood as a 2M x 2M matrix with indices i and j running from 
1 to M . It describes the evolution of infinitesimal perturbations Ax© Aw during 
one time step. 
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The entries of this matrix are of the following form: 

ox] yf-K 



dx] 



= for i = 2, . . . , M; 



for i = 2, . . . , M; 



^ _^erf(/3/.)5.,-^/5^a;*-|exp(-/3^/.^); 

^ = 5„-^/34^.*e^P(-/5'^')- (2-33) 

If \f3h\ is large and the transfer function is saturated, the exponential terms in Eq. 
(2.33) are negligible. In that case, the upper left section of M is occupied only 
on the first lower off-diagonal, and the lower right section is the M x M unity 
matrix. Since the upper right section is identically 0, the lower left part does 
not enter into the calculation of the eigenvalues either. This is not completely 
obvious, but it follows from the definition of the determinant of a matrix as 
the sum over permutations of the matrix elements [45]: if one of the factors of 
a permutation is from the lower left part, at least one factor must be from the 
upper right part, which is 0, and therefore that permutation gives no contribution 
to the determinant. 

This simplified matrix has M eigenvalues A = and M eigenvalues A = 1. 
The eigenvectors of the latter span the space of weight vectors, where small 
changes to w* are transferred unmodified to w*"*"^. The eigenvalues A = all have 
the same eigenvector, whose only non- vanishing component is xm, the component 
of the sequence vector that is rotated out att+1. This means that the eigenvectors 
do not span the whole space and that thus the eigenvalues are not a reliable 
measure of the propagation of a disturbance in the system. 

If 1/5 /t I is small enough for the exponential terms to have an appreciable effect, 
the effect on the eigenvalues is not easy to calculate. By using values of x and w 
taken from a run of the simulation and numerically calculating the eigenvalues, 
one finds that typically one of the A = eigenvalues is changed drastically and 
may have an absolute value |A| > 1. This corresponds to a strong susceptibility 
of the newly generated sequence component Xi on small changes in w or x. The 
other eigenvalues only undergo small corrections, corresponding to the feedback 
of the new component to the weights. 

During the regular phases of intermittent behavior, the largest eigenvalues of 
the one-step matrix are significantly smaller than during the chaotic bursts (see 
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Fig. 2.9) - corresponding to sequence values that are close to x — ±1, and thus 
a nearly saturated transfer function. 

Long-time behavior 

To find the Lyapunov exponents of the map, which describe the exponential 
growth or decay of small perturbations of the initial conditions (see e.g. Ref. 
[46]), it is necessary to consider the development of a small perturbation over 
a long time, i.e., to calculate the eigenvalues Aj of H^i -l^* (^^ course, the 
trajectory is determined using the full nonlinear map). The Lyapunov exponents 
are then defined as 

lim(l/r)ln|Af|. (2.34) 

The straightforward calculation of the product of Jacobi matrices brings many 
numerical problems which can be eliminated by applying a Gram-Schmidt or- 
thonormalization procedure to the columns of the product matrix in regular dis- 
tances, as described in Ref. [47]. With this procedure, it is possible to average 
over T > lOOM and get numerically stable results. The largest Lyapunov expo- 
nent is displayed in Fig. (2.10). Typically, there are M/2 positive exponents. 

Attractor dimension 

The Kaplan- Yorke conjecture [48] states that there is a connection between the 
dimension D of a attractor of a map and the spectrum of Lyapunov exponents, 
which are here assumed to be ordered (Ai > A2 > . . . > X2m)- 

k 

Di^y = A; + 5];Ai/|Afe+i|, (2.35) 

where k is the value for which X]i=i > ^'^d Xlt^i ^ 0- Applying this to the 
spectrum of exponents derived from (2.34) gives an average attractor dimension 
between I.IM and L2M, slightly depending on 7. 

The attractor dimension is a measure for the effective number of degrees of 
freedom of a chaotic system. It cannot be larger than the number of dynamic 
variables of the system (which, in this case, is 2M). A much smaller number is 
possible, as seen in the case of the SGen (Sec. 2.3.1), where the quasi-periodic 
attractor is one-dimensional. In that case, all components of the sequence vector 
are strongly correlated. The fact that the Kaplan- Yorke dimension of the CSG is 
larger than M indicates that its pattern components are more or less independent. 

Lyapunov exponents for Icirge M 

An alternative method for measuring the largest Lyapunov exponent is to start 
two trajectories with infinitesimally different initial conditions, and propagate 
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Figure 2.10: Lyapunov exponents measured from the time development of per- 
turbations and from propagating the Jacobi matrix (Eq. (2.34)), for 7 = 8 and 
7 = 12. 

both of them using the nonlinear map. In regular intervals, measure the distance 
between the trajectories, store it, and reset the distance to the initial value while 
keeping the direction of the distance vector. 

Since the largest Lyapunov exponent dominates the growth of the pertur- 
bation, the contribution of all other Lyapunov exponents can be neglected for 
sufficiently long times. The advantage of this method is that it requires only 
0{M) calculations per time step, rather than C(M^) hke the previous way, al- 
lowing to go to much higher M. 

The results for Ai are also displayed in Fig. 2.10: the values gained by the 
two methods agree well within the numerical errors. For large M, Xmax decreases 
with 1/M, i.e., perturbations grow on the cc-timescale of online learning. 

This is shghtly astonishing, since the matrix Eq. (2.33) has entries that 
are proportional to 1/M, but also entries that are of order 1. In the limit of 
large M, one would expect that it is these entries that are responsible for the 
growth of deviations. In that simplifying picture, the CSG is similar to the SGen 
with fixed weights, but is kept in the chaotic transient by the slow movement of 
weights. However, even in that picture, it takes M time steps for a perturbation 
to spread throughout all the variables: the matrix entries of order 1 affect the 
newly generated sequence component, which is then rotated through the M- 
component pattern vector step-by-step. This would explain the 1/a-behavior of 
the Lyapunov exponents. 
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2.3.6 Summary of the CSG 

Just like for the Confused Bit Generator, the step from fixed to variable weights 
qualitatively changes the behavior of the system: while the SGen usually gener- 
ates a quasi-periodic sequence, the CSG is a rather complicated nonlinear map- 
ping that shows many of the trademarks of chaotic systems: high- dimensional 
chaos, intermittency, the emergence of stable attractors upon variation of the con- 
trol parameter 7. The existence of an attractive absorbing state makes the study 
of chaotic behavior difficult for small M. For large M, chaos is stable enough 
for solid numerical measurements of Lyapunov exponents and other statistical 
quantities. 

The mean square of the norms and outputs can be calculated in a mean-field- 
hke approach, which correctly predicts the saddle-node bifurcation at which a 
fairly stable chaotic regime becomes possible. Quantitative differences between 
the calculation and simulations have their origin in temporal correlations between 
patterns and outputs. 

2.4 Decision Tables 

2.4.1 Decision tables and DeBruijn graphs 

If binary histories are considered, the most general prediction algorithm consists 
of a Boolean function that has an individual prediction for each possible history. 
The downside of this generality is that no generalization is possible: since there 
is no notion of similarity between two histories, one cannot deduce correlations 
along the lines of "similar histories usually lead to similar consequences". To 
make this clearer, and for convenient notation, histories will be denoted by a 
single number /i, written out as a binary string such as 110010, instead of a 
vector X = (1, 1, —1, —1, 1, —1). 

There are two convenient ways to visualize a Boolean function: one is a lookup 
table which contains, in each row, a history /i and a prediction a^, as in the 
example given in Fig. 2.11. 

The other representation of the Boolean function takes a graph-theoretical 
perspective, using directed DeBruijn graphs of order M (see Fig. 2.12 for an 
example of such a graph). These graphs have one node for each possible pat- 
tern ji. Furthermore, each node has two edges entering it, coming from the two 
possible predecessors, which I denote ''/x and ^/x. For example, if = 1100, the 
possible predecessors are "/i = 0110 and ^fi = 1110. Each edge also has two 
outgoing edges, leading to the two possible successors /i° and /i^ (for /x = 1100, 
the successors are /jP — 1000 and /i^ — 1001). The graph is connected, since one 
can reach each node from any other node in a maximum of M steps by taking 
the appropriate exit edges. 



29 





u 
«0 


000 





001 





010 


1 


on 


1 


100 





101 


1 


no 


1 


111 






Figure 2.11: A Boolean function for M = 3, represented as a decision table. 




Figure 2.12: The Boolean function from Fig. 2.11 represented as a directed 
DeBruijn graph of order 3: Nodes represent binary strings of length 3, edges 
lead to strings that are generated by shifting the current string one position and 
adding either or 1 as the new least significant bit. Solid lines denote the active 
exit. 
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A DeBruijn graph can be modified to represent a Boolean function by labeling 
only the exit corresponding to of each node fi "active" . The sequence can then 
be easily found by going from node to node following the active exits. 

2.4.2 Static tables 

Each cyclic sequence in which any M-bit pattern appears at most once can be pre- 
dicted by a suitably prepared M-bit Boolean function. At most, these sequences 
have length 2^: namely, if all patterns appear exactly once. These sequences are 
Hamiltonian circuits or full cycles on the DeBruijn graph of order M. These cy- 
cles have been studied extensively; for a review, see Ref . [49] . One of the central 
(and oldest) results [50] is the number of different Hamiltonian cycles, which is 

22^-1 -M 

When the system is initialized randomly, there is only a slim chance that 
the cycle indeed has full length: there are 2^ possible initial states of the de- 
cision table, of which only 2^ ~^ correspond to a full cycle - a fraction of 
2-2 -M ]y[ost cycles are significantly shorter. The following crude estimate, 
which ignores much of the structure of the DeBruijn graph, will nevertheless give 
an approximation of average cycle lengths. 

Consider a random graph with the following properties: each of the N nodes 
has two potential incoming edges and two outgoing edges, of which only one 
is active. Which node is connected to which is random. You are starting on 
one node and following the active exits until you come back to a node that you 
have visited before (i.e., you enter a cycle). Given that you have traveled r time 
steps without entering a cycle, the chance of hitting a cycle in the next step 
is Phir) — t/2N - there are 2A^ possible entrances, but each of the previously 
visited nodes is weighted only with one entrances, because the other one was 
already "spent" - it is connected to the node you visited one step earlier. 

The probability of entering the cycle for the first time after t steps is 

t-i 

Ph{t) = l[{l-ph{T))ph{t). (2.36) 

T=l 

Since at that time there are t nodes that can be re- visited (or rather t — 1, but 
this is just an approximation anyway), the cycle that is hit after t steps has an 
average length I — t/2. A further approximation deals with the product in Eq. 
(2.36): 

t 

Y[[l - t/{2N)] « [1 - t/{4N)f « exp[-ty{4N)]. (2.37) 

T 

The average cycle length is then the sum over all transient times, weighted with 
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Figure 2.13: Average cycle length of a static Boolean table, starting from random 
initial conditions: averages over at least 10^ runs are compared to Eq. (2.38). 

the right probability. The sum can be approximated by an integral: 

°° poo / f2 \ 4.2 rz 

(0 = ^ I dtew [-^) ^ = (2.38) 

Using N — 2^ from the DeBruijn graph, the average cycle length scales with 
2M/2 Simulations give a remarkably good fit to the value calculated in Eq. 
(2.38), as seen in Fig. 2.13. 

2.4.3 Dynamic tables: complete antipersistence 

As pointed out in the introduction, a prediction algorithm needs a learning rule in 
order to adapt to a given problem. The simplest learning rule for decision tables 
is to set the table entry corresponding to the current history to the observed value 
of the time series. 

In order to generate the time series that is completely antipredictable for this 
prediction algorithm, the entries in the decision table have to be changed at every 
time step. ^ To be precise, the algorithm is as follows: At each time step, 

"^Thc feature that two consecutive appearances of a pattern are likely followed by different 
outputs was called "antipersistence", e.g. inRef. [51]. Usually this term is applied to continuous 
time series with a Hurst exponent < 0.5 [52]; however, I will use it for binary time series, in 
the sense mentioned above. 
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Figure 2.14: An example of a decision table with M = 2 and two steps of the 
dynamics. Boldface numbers indicate the current history and the table entry 
used for continuing the sequence; italic numbers denote the last table entry that 
was changed. The sequence generated in this example is 010 



• a new bit is generated by taking the table entry corresponding to the current 
history /x^: s^+i = ; 

• the history is updated: fit+i = {'^^t + St+i) mod 2'^; i.e., all bits are shifted 
one position to the left (multiplication with 2), the newly generated bit is 
added, and the oldest (most significant) bit is dropped (division modulo 

2M). 

• the table entry a^' that was used for making the decision is changed, such 
that the sequence will be continued with the opposite decision when the 
pattern /it occurs the next time: a^^i — 1 — af*. All other entries remain 
unchanged. 

It should be noted this definition of the dynamics is different from the definition 
as it was used for the CBG: the output of the generator is used for continuing 
the sequence, rather than the inverse of the output. However, the same time 
series could be generated by taking always the inverse of the output, but using 
the inverse of the decision table for initial conditions. I chose the convention 
presented above to avoid the paradox that one would travel through the graph 
following the inactive exits. 

As in the case of the confused perceptron, the internal parameters of the 
prediction algorithm (in this case, the table entries) become dynamical variables. 
As an example, three steps of the combined dynamics of sequence and table for 
M = 2 are shown in Fig. 2.14. 

In terms of DeBruijn graphs, history moves from one node to the next along 
the currently active exit. After the move, the exit of the node that was just left 
is switched, as shown in Fig. 2.15. 
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Figure 2.15: Example for a step of the dynamics on a graph of order 2. Active 
exits are denoted by solid hnes, inactive ones by dashed hnes. The bold circle 
indicates the node currently visited. Both the upper left configuration (which 
happens to be part of a cycle, and corresponds to the example in Fig. 2.14) and 
the lower one (which is part of a transient) lead to the configuration on the right, 
which shows that the dynamics is irreversible (see Sec. 2.4.4). 

2.4.4 Properties of cycles 

The introduced dynamics is deterministic, and the combined system of pattern 
and table has a finite number Q = 2^ • 2*^^ ^ of different states, so the dynamics 
necessarily leads into a cycle eventually. The dynamics is irreversible: if a cur- 
rently visited node has two inactive entrances, it is impossible to tell which path 
the system took to get to its current state (for an example, see Fig. 2.15). This 
means that not every state can be part of a cycle, so we will have to consider the 
necessary conditions for being in a cycle. I will show, step by step, that all cycles 
are of length 2 • 2^ and touch all nodes exactly twice. 

Some of the proofs that now follow are redundant; on the other hand, they 
help to understand the properties of the system, and some of them are applicable 
to generalizations of the problem, whereas others are not. 

Let us assume that at time 0, the system is already moving on a cycle of 
length I. We count the number of times that a history // has occurred between 
time and time t by a visit number . Since the definition of a cycle is that 
after / steps the system must be in the same state again, it is necessary that 
is even for all n, since the table entries return to their original state only after 
even numbers of visits. 

Also, all possible nodes are part of the cycle. Let us prove this by assuming 
the opposite, namely that there are some nodes that are not touched by the 
cycle. Since the graph is connected, there must be unused connections between 
the part of the graph involved in the cycle and the part that is left out. But as 
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we have seen in the paragraph before, the visit number of each of the nodes that 

are actually part of the cycle must be at least 2 (larger than 0, and even), so 
each of its two exits is used, including the one leading to the part of the graph 
supposedly not included in the cycle. This is a contradiction, so all nodes are 
involved. 

An even stronger statement is possible: the total number of visits to the 
predecessors and ^/i of ^ must be equal to twice the number of visits to fi, 
since exactly half of the visits they get are followed by ^, while the other half 
is followed by the so-called conjugate state jl of /i. (For example, 001 is the 
conjugate state of 000.) Thus, we have 



for all fjL. 



(2.39) 



This can be written as a linear equation for an eigenvector with eigenvalue 1 of 
a matrix M, with entries a^^j, = 1/2 if is a possible successor of u and a,^^ = 
otherwise. For example, for M — 2, the set of Eqs. (2.39) looks as follows: 



2 





LV 









? 

2 








1 

2 



- l4 



,10 








(2.40) 



Since the sum of columns in matrix M is always 1, and the individual entries are 
> 0, and it describes transitions on a connected graph, we can apply results from 
the theory of stochastic matrices to state that it has one unique eigenvector with 
eigenvalue 1 [53], and we easily guess that — const fulfills Eq. (2.39). That 
means that in a cycle, all states are visited with the same frequency. 

The next step is to show that in a cycle, each node is exactly visited twice, 
i.e., all cycles of length 4 ■ 2^, 6 ■ 2^, and so on, are in fact two, three or more 
repetitions of a 2 • 2*^-cycle. Again, assume the system is moving on a cycle. 
If this cycle were truly longer than 2 • 2^, there must, at the point t — 2 ■ 2^, 
be nodes that have been visited three or more times while others have not been 
visited for the second time - the visit numbers must add up to 2 • 2^, and if all 
visit numbers were equal to 2, the cycle would be complete. More specifically, 
there must be an earlier time when all visit numbers are either 0, 1, or 2, and 
one node is about to be visited for the third time. It suffices to show that this 
cannot happen to prove that the cycle cannot be longer than 2-2^ . 

The third visit to a node with = 2 cannot come from a predecessor 
(let us say, °/i) with a visit number of f = 0, for the obvious reason that this 
predecessor has not been visited yet. It also cannot come from a predecessor 
with V ^ = \: if i;^ = 2, either it must have been visited before from '^/x (which 
it cannot - the predecessor has only had one visit so far), or it must have been 
reached twice from - this is impossible as well, since it means that v^>'i. For 
similar reasons, we can exclude a visit from a node with v ^ — 2: either v ^ > 3 
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as before, or the first visit to °// led to /i - then the second cannot. This means 
that all nodes must receive two visits - thus finishing a cycle - before one of them 
can be visited for the third time. The question arises why this line of reasoning 
does not hold true during the transient. The key lies in the observation that the 
first node to receive three visits is the node where the system was started - it 
did not get its first visit from anywhere on the graph, so the arguments are not 
applicable. 

Using all previous conclusions, the cycles turn out to be of the kind mentioned 
before: since every combination of M-bit pattern and following bit, i.e., every 
M + 1-bit pattern, occurs exactly once, the cycles are Hamiltonian paths, or full 
cycles, on the M + 1-graph. Again using results from Ref. [50], we know that 
there are 2^ -(m+i) (distinct Hamiltonian paths. 

All possible full cycles on the M + 1-graph can be generated by the antiper- 
sistent walk on the M-graph: write down the desired sequence starting at some 
arbitrary point, look for the first occurrence of each M-bit pattern /i, and set 
the corresponding table entry to the bit that follows it. Starting the antipersis- 
tent walk at the first pattern of the desired sequence, the antipersistent walk will 
reproduce it. 

Since aU cycles are of length 2 • 2^, a total of 2 • 2*^ x = 2(2*') states 

is part of a cycle. As mentioned before, the total number of possible states is 
Q = 2^ • 2(2 \ which means that a fraction of 2^^ ^ /Q = of possible states is 
part of a cycle. It it thus interesting to check how long it takes for the system to 
reach a cycle, i.e. study the distribution of transient lengths r. This distribution 
is not easily accessible to analytical approaches, but easy to measure in computer 
simulations, either by complete enumeration for small systems or by Monte Carlo 
simulations for larger ones. The following picture emerges, as shown in Fig. 2.16: 

The probability for transient length r = is just the probability of hitting 
a cycle right away, and thus the fraction of state space filled with cycles. As 
mentioned, this is equal to . 

The probability distribution is more or less fiat for 1 < r < 2-^+^, the cycle 
length. Prom normahzation constraints, it follows that p{t) 2~(^+^) in that 
range. 

Near r = 2^+\ there is an exponential drop reminiscent of a phase transition, 
which gets steeper with increasing M. Even for small M, no transients longer 
than 2^"'"2 have been observed. 

Antipersistence on different timescales 

Obviously, a sequence that is antipredictable for one algorithm can be perfectly 
predictable for another. Is it necessary for that other algorithm to be completely 
different in its structure, or is it maybe enough to adjust some parameters? To 
answer this question for this particular algorithm, it suffices to let an observer 
predict the antipersistent time series using a decision table, and vary the memory 
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Figure 2.16: Distribution of transient lengths r, rescaled by the cycle length 

2M+1 

length of the observer. For simplicity's sake, we consider the long-time limit, in 
which both the generator and the observer move on a cycle. 

If the observer looks at the same time window as the generator {Mobs — M), 
it is obvious that the success rate will be - since each pattern is continued with 
alternating bits on each visit. For an observer with a slightly larger window, 
the picture changes: as mentioned above, the antipersistent cycle corresponds 
to a Hamiltonian cycle on the M + 1-graph, which is completely persistent and 
predictable with 100% accuracy. For even larger Mq^^, the antipersistent cycle 
looks like a closed path which includes only a fraction of 2^^(*^°f's+^) of nodes 
on the Mofcs-graph. Prediction is again 100% reliable, and the observer does not 
even need all of his storage capacity to handle the cases that occur. 

If the observer has a shorter time window than the generator, more than one 
of the generator's patterns will affect the same table entry for the observer. For 
example, an M — 1-bit pattern 1/ corresponds to either of the M-bit strings Ou or 
lu, both of which occur twice in the M-cycle, each time followed by a different 
successor. The success rate of the predictor depends on the sequence in which 
these combinations occur; if each permutation of OuO, Oul, luO and lul has the 
same probability, the success rate for all patterns is the average over the different 
permutations. Fig. 2.17 shows that this average is 1/3 for Mobs = M — 1. 

For Mobs = M — 2, all permutations of eight combinations of predecessors 
and successors have to be taken into account - a task best left to computer 
algebra programs, which yield (s(M, M — 1)) = 3/7, in excellent agreement with 
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Figure 2.17: Possible permutations of predecessors and successors of a pattern u 
on a cycle. The error rate, given in the last column, is the rate of flips between 
and 1 in the sequence of successors. 

simulations (see Fig. 2.18). Larger differences in the time window are beyond 
even the scope of computer programs; however, it can be argued that for larger 
M — Mobs, the visits to the Mo^s-nodes become more and more random, and 
s(M, Mobs) will tend to 1/2. 

To answer the question raised above, it is thus not always necessary to go to 
structurally completely different algorithms or updating methods to turn a neg- 
ative overlap into a positive one. However, as the next section shows, increasing 
the amount of information that is processed does not always improve prediction 
accuracy. 

2.4.5 Stochastic antipersistence 

All of the observations from the last sections relied on the fact that the generator 
is on a cycle with well-known properties. It is thus interesting to ask how stable 

these results are if the sequence is not completely antipcrsistent. The simplest 
generalization is to introduce a probability p for changing the table entry/ exit 
when visiting a node: pap = 1 reproduces the completely antipcrsistent walk; 
Pap = is equivalent to using a constant (quenched) decision table, and p = 1/2 
generates a completely random time series. 

A first intuitive guess would be that even a small deviation from deterministic 
dynamics completely destroys all predictability: after all, on a path of length 
2M+i^ there are on the average (1 — pap)'2.^~^^ occasions where the the sequence 
is continued persistently, thus leaving the cycle. Indeed, a single "error" is usually 
enough to move the system from one cycle to another; however, much of the local 
structure remains untouched. It turns out that the functions s{M, Mobs, Pap) of 
prediction rates converge for large M (meaning roughly M > 12) to a set of 
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curves that depend only on pAp and M — Mobs, which is displayed in Fig. 2.18. 

The limit values for p^p — 1 have been explained above, and they are ap- 
proached continuously for pap — ^ 1- For pAP = 0.5, the curves intersect at s = 0.5 
- no prediction beyond guessing is possible. For small Pap, all curves converge 
to 1: the system is dominated by short loops in which only a small fraction of 
the possible states participate, and those are predicted with high accuracy. 

Interestingly, between pap = 0.5 and roughly p^p = 0.85, all shown curves are 
below 0.5, meaning that even observers with longer memory predict the sequence 
with less than 50% accuracy. I will give an analytical argument why this is the 
case for Mobs = M + 1. An M + 1-bit pattern v is a, combination of an M- 
bit pattern /i and one of its predecessors, let us say °//, whereas the companion 
state P is a combination of fi and the other predecessor ^p. A visit to either 
p or u switches the exit of n with probability Pap- Consider two subsequent 
visits to u, with some number / of visits to D between them. The probability 
s{pap, M, M + 1) of continuing with the same bit after these two visits is a sum 
of two probabilities: either the exit of /i was switched upon leaving u the first 
time and then switched an odd number of times during the / visits to 0, or it was 
not switched the first time and switched an even number of times in between. 
Given pAp and the probability 7Ti{pAP,M) of having / intermediate visits to D, 
one then obtains by basic combinatorics 

s{pAP, M,M + 1) = J2 ^Mpap: M)[1 + (1 - 2pAPy+']. (2.41) 

1=0 

Unfortunately, Tri{pAP, M) does not seem to be analytically accessible for general 
Pap- It can be measured in simulations, and the accuracy of Eq. (2.41) can be 
verified (see Fig. 2.18); also, for p = 1/2, since the system does a completely 
random walk on the graph, one gets the simple distribution 7r;(l/2, M) = 2^*^'"*"^^ 
Assuming that this distribution does not change discontinuously near p + AP = 
1/2, Eq. (2.41) yields the approximation s(l/2 + dp, M,M + 1) 1/(2 + 26p) « 
(1/2) (1 - 5p). This is obviously < 1/2 for Sp > 0, i.e., pap > 1/2. 

Numerical evidence suggests that near the point of random guessing, the ap- 
proximation s(l/2 + 5p, M, Mobs) ~ 1/2 + 2~^^^^^°''^^6p holds, i.e., the prediction 
success is below 1/2 for all small Sp > 0, but the deviation from guessing de- 
creases as the difference in time scales increases. As an educated guess, one can 
extrapolate that as Mobs — M — > oo, the success rate will be close to 1/2 for all 
values of pap except very close to pap — and pap — 1, where it will approach 
1. 

The error rate 1 — s{pAp, M, Mobs) is a measure of the antipersistence of the 
time series on the scale Mobs - for Mobs — M, 1 — s is completely equivalent 
to the antipersistence parameter p of the underlying dynamics. However, even 
if s{pap-i M,M -\-1) could be calculated, it would not be possible to recursively 
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Figure 2.18: Success rate of an observer keeping a table of the recent occurrence 
of Mobs-i>^^ strings and the respective following bit, for M = 16. pap is the prob- 
ability of flipping the exit in the generating graph. The symbols labeled "theory" 
were calculated using Eq. (2.41) with approximate probabilities Tri{pAP, M) taken 
from the simulation itself. 

apply this function to find the antipersistence on the scales M + 2, M + 3 etc. In 
other words, 

l-s{pAP,M,M + l) ^ 
1 - - s(pAP, M,M + 1)}, M + 1, M + 2). (2.42) 

It is thus not sufficient to give a single parameter pap, or 1 — s, for some M in 
order to characterize the behavior of a time series completely and to calculate its 
predictability on other scales of observation. The scale on which the dynamics 
work is important as well. 

2.4.6 Summary of antipersistent time series 

Every prediction algorithm is sensitive to specific features in the time series. 
For the perceptron, this feature was autocorrelations. For decision tables, it is 
fairly obvious: it is the probability that a given pattern is followed by a certain 
bit. Apparently, antipredictable sequences suppress the sensitive feature of their 
corresponding prediction algorithm: in a completely antipersistent time series 
generated by an M-bit decision table, the probability of continuing an M-bit 
pattern with 1 is 50%. 
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This means that all quenched preferences for patterns vanish if antipersistence 
is introduced: all patterns appear with equal probability. In the deterministic 
antipersistent time series, it is even possible to prove that all patterns appear 
exactly twice during one cycle. This is a recurring theme of antipredictable time 
series: cycles are longer than for predictable algorithms, because the state space 
of the generating algorithm becomes larger. 

I have also shown that observers that keep track of the most recent occurrence 
of Mobs-bit strings can predict the completely antipersistent cycle with 100% 
accuracy if Mobs > M, and with less than 50% success rate if Mobs < M. If 
the stochasticity is introduced by means of a probability pap of flipping the exit 
edges, the success rate even of observers with Mobs > M can drop below 50%, 
which shows that larger memory does not necessarily give better results. The 
rate of antipersistence on one scale is not sufficient to calculate the rate for other 
scales, which again shows that a time series is antipredictable only for a specific 
algorithm with specific parameters, including memory length. 
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Chapter 3 

The Minority Game and its 
variations 

3.1 Introduction 

In recent years, physicists have apphed methods from statistical physics and 
time series analysis to problems in sociology, biology, and economics [54, 55, 56]. 
One of the important techniques was the invention and analytical treatment of 
toy models that still capture essential properties of vastly more complex real- 
world problems. The Minority Game [13], which has inspired more than 100 
publications so far, has become one of the most influential models. 

The inspiration for the Minority Game (MG) comes from the El-Farol bar 
problem brought up by W. B. Arthur [57]: a popular bar has a limited capacity 
for patrons. If fewer people attend, they have a good time. If the capacity is 
exceeded, the bar is crowded, and the potential patrons who decided to stay at 
home that night made the better choice. Arthur's hypothesis was that people have 
a number of possible prediction algorithms to decide whether to go or stay home, 
and that they pick the algorithm they use according to their individual success 
with that algorithm. As a result, attendance fluctuates around the saturation 
value. 

The underlying idea of the El-Farol scenario is competition for limited re- 
sources, and can be applied to a large number of different fields. One often-cited 
example is stock markets, where it pays off to sell if everyone else wants to buy, 
and to buy if there are many bids to sell. Other possible scenarios include alter- 
native roads between two locations, the specialization of animals on food sources 
or habitats, a university student's choice of field with a view to job prospects 
later on, and many more. 

One obvious feature of these scenarios is that no "best strategy" can exist: if 
it did, it would be the same for all players (since there are no a priori differences 
between them), every player would make the same choice, and all would lose. 
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The game thus rehes on coordination between the players: the aim is to find a 
niche that few others occupy, and the means cannot be deduction and careful 
planning, but only inductive thinking - learning from experience and trial and 
error. 

In Arthur's paper [57], each potential patron has a repertoire of fundamentally 
different prediction algorithms, which use the time series of previous attendances 
to predict how crowded the bar will be. He monitors how well each algorithm 
predicts the actual history, and chooses the most successful one. 

This idea was simplified and formalized in the original MG by Challet and 
Zhang (see Sec. 3.3), where the threshold for overcrowding was set to half the 
number of players, and the different prediction algorithms were modeled by ran- 
domly chosen decision tables (Boolean functions). However, other rules of behav- 
ior are conceivable, like fine-tuning the parameters of one prediction algorithm 
such as a neural network (see Sec. 3.4) or changing individual entries in a deci- 
sion table if they do not seem to bring success (Sec. 3.6). Another interesting 
generalization is to allow for more than two different options, as detailed in Sec. 
3.7. Again, all sorts of strategies are conceivable in this situation. The following 
sections give an overview over the different strategies, with plenty of detail on 
my contributions to the field. Where possible, I will try to point out where the 
behavior of the players leads to antipredictability in the generated time series. 



In all variations of the Minority Game that will be described, the following rules 



• there is an (odd) number of players i = 1, . . . , A^. 

• at each time step t each player makes a decision cr* e {+1, —1} (buy/ sell, 
go/ stay at home, take the highway/ take the country road etc.). 

• the minority is determined: — — sign(^^(7|); the players in the minority 

win (cr* = 5**), the others lose. 

• global efficiency can be measured by 



Random guessing would result in cr^ = N. Smaller values indicate good 
coordination, whereas larger values (often of order A^^ ) indicate herding 
behavior: finite fractions of the player population correlate their decisions 
and effectively act as a single agent [58]. 



3.2 Rules of the Minority Game 



hold: 




(3.1) 
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• players cannot make contracts or otherwise communicate with each other. 
The only exchange of information is through the global minority. 

• consequently, players can use a window of the global history for making 
their decision. In many cases, however, very similar or identical results are 
achieved if the actual history is replaced by an artificial (random) history 
[59]. 

3.3 The standard Minority Game 

In a series of pubhcations by Challet and Zhang [60, 61] and Challet, Marsili 
and Zecchina [62], the following model, which became known as the "standard 
Minority Game", was introduced and studied numerically and analytically: 

Each player is equipped with two decision tables and A_, each of which 
prescribes an action a'^ or a'^ for each state of the world ii. The state is often 
taken to be the string of the last M global minority decisions; e.g., if M is set 
to 3, and in the last three steps the global decision was —1,-1,1, the current 
state, or "history", is ^ = (—1, —1, 1). Each decision table thus has 2^ entries, 
which are determined randomly at the beginning of the game. Of course, each 
player can only use one decision table at any given time, so he keeps scores for 
each table and follows the table with the highest score. 

The sign of the sum of all players' choices = aj then determines who 
wins. Those in the minority (s* = — sign(A*)) gain a point, the others lose. 
Players also update the scores of their decision tables: Tables that would have 
predicted the minority sign (a^^ = —sign{A^)) gain a point, regardless of whether 
this table was used or not. Tables that gave "bad advice" lose a point on their 
score. Then the game is repeated, and players make choices based on the updated 
state of the world and possibly use a different decision table than before. 

As mentioned above, success or failure of coordination can be measured with 
the standard deviation of A, usually referred to as o"^ = {A'^) in the literature. 
Random guessing of all players leads to a"^ = N; a. value a^/N < 1 therefore 
indicates successful coordination, whereas a'^/N > 1 is a sign of "herd behavior": 
agents adapt the same strategies that others are using as well, leading to larger 
fluctuations. 

Closer analysis shows that the relevant parameter in the original MG is the 
ratio a = 2^ /N of entries in the decision table to the number of players. Is is 
also possible to replace the time series with a random state of the world, with p 
possible states. The control parameter is then a = p/N. For a > etc ~ 0.33740 
(i.e., few players, compared to the number of fundamentally different strategy ta- 
bles available), there is fairly good cooperation {(t'^/N < 1). However, the frozen 
disorder in the decision tables causes some preference in the outputs generated 
by the system: each pattern /i leads to outputs +1 or —1 with probabilities 
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Figure 3.1: Standard minority game: results of simulations for cr'^/N, H and the 
antipersistence parameter pAp as a function oi a = p/ N . H is multiplied by 1- 
for better visibility. Simulations use N — 201 and random patterns. 

P''(±l) 7^ 1/2. The average deviation of this probability from 1/2 can be quan- 
tified by defining 



At a = ttc; the information that can be extracted from the time series in this 
fashion vanishes in a second-order phase transition. Instead, dynamics become 
ergodic (i.e., each pattern fi is visited with the same frequency) and increasingly 
antipersistent (i.e., if a pattern results in sign(A) = -|-1, the probability pap of 
getting —1 on the next occurrence of /i is larger than 0.5). The interpretation is 
that players overreact to occurrences of a pattern, switching to a strategy that 
prescribes the opposite reaction for the next time that this pattern occurs. This 
can lead to dramatic crowding, with oc N'^ as a — > 0. These results are 
illustrated in Fig. 3.1. 

The standard MG can be solved analytically by treating the choice between 
the two decision tables as a spin variable and applying the replica method to find 
the ground state of the system [61, 62]. This analysis reveals that the quenched 
preference for specific outputs is necessary to achieve coordination. If players 
have two completely opposite decision tables instead of two random (and hence, 
approximately orthogonal) tables, there is no bias to be exploited, and a'^ /N goes 
to 1 for all values of 77. 




(3.2) 
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Newer studies [63, 64] have shown that the dynamics in the crowded phase 
depend strongly on initial conditions: if the initial difference between the scores of 
the two decision tables are drawn randomly from a sufficiently wide distribution, 
only few players switch from one table to the other in response to one pattern, 
and coordination becomes better. 



An alternative strategy for the Minority Game was first presented in Ref. [21] 
(although with an erroneous calculation) and published in Rcfs. [65] and [66]: 
each player is equipped with a simple neural network (a perceptron, to be precise) 
that is fed with the vector of past minority decisions. 

In a sense, the ensemble of perceptrons is an inverted committee machine: the 
regular hard committee machine is a two-layer neural network where the output 
is given by the majority of outputs of the perceptrons that make up the first layer 
[18, 67, 68], whereas in our case the minority decides who wins. Another difference 
is that a large number of training algorithms for the committee machine are not 
compatible with the rules of the MG: since the "members" of the committee 
machine can communicate and cooperate, it is for example possible to select the 
perceptron with the smallest hidden field and modify only it - the so-called least- 
action algorithm. The players in the Minority Game cannot compare notes, and 
they are assumed to be greedy - foregoing an opportunity to join the winning 
side in favor of a co-player is not possible in the model. 

Therefore, only learning algorithms that include quantities locally available 
to the neural networks are allowed. The simplest (and the only one for which an 
analytical solution exists so far) is Hebbian learning of all networks. 

The calculation I will now present is a correct, more complete version of 
that in my Diploma thesis [21]. It is included for the sake of completeness and 
because it facilitates understanding of the slightly more involved calculation for 
multi-choice neural networks in Sec. 3.7.2. 

3.4.1 Ensembles of vectors 

As mentioned, each player in this variation of the MG has a perceptron with an 
individual weight vector Wj to make his decision: 



When considering a population of players, it is helpful to split the weight vectors 
into a center-of-mass vector 



3.4 Neural Networks in the MG 




(3.3) 




(3.4) 
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and relative vectors 

= - C. (3.5) 

These relative vectors are automatically anti-correlated: if the initial conditions 
are chosen such that |r| = r = 1, one gets from the condition ^ = 

A different norm of r can, of course, be included in the calculation by replacing 

I C| = C by C/r in the appropriate places. If the initial vectors arc chosen at 
random such that (r) = 1, it is a fair approximation that each pair of vectors 
individually is anti-correlated as in Eq. (3.6) if the dimensionality of the space 
they live in is large enough: the variance ((rj-Tj + '^/{N — 1))^) is of order 1/M. 
The average scalar product between two different weight vectors is now 

and their average norm is 

- ^ E ^ + 1- (3-8) 

i 

The correlation between the output of two perceptrons on the same random input 
pattern x is 

(sign(x- Wi)sign(x-W2))x = 1 arccos ( ] . (3.9) 

TT \ W1W2 J 

With these relations, the global efficiency cr^ = (X^j ctj)^ can be calculated: 
,2 



(7 

TV N 



^ ( E ^ + E sign(x-Wi)sign(x-Wj) \ 

\ i=l i.'j^i I 



1 + (AT - 1) ( 1 - ^ arccos {^^-^1^^^ ] ] ■ (3-10) 



TT 



If C is set to and N is large, a linear expansion of the arccos term in Eq. 
(3.10) gives (rlpt/N k, 1 — 2/7r = 0.363. The small anticorrelations (of order 1/ K) 
between the vectors suffice to change the prefactor in the standard deviation. 

If C is much larger than r, there is a strong correlation between the percep- 
trons. Most perceptrons will agree with the classification by the center of mass 
sign(x-C). As C ^ 00, a'^/N saturates at A^. 

What these considerations have shown is that, assuming symmetry between 
the perceptrons, the system can be described by the number A" of players and 
one order parameter, C (or, strictly speaking, C/r). The next section describes 
what happens to that order parameter during a simple learning process. 
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3.4.2 Hebbian Learning 

The most remarkable property of neural networks is the ability to adjust their 
parameters to unknown rules using efficient learning algorithms, and this is what 
the players in this model do: At each round of the game, each perceptron is 
trying to learn the decision of the minority according to the Hebbian learning 
rule. S denotes the minority decision, and the superscript + denotes the updated 
quantity after the learning step: 

N 

w+ = Wi - -^xsign( J^sign(x-wj)) = + j^^S. (3.11) 

As the same correction is added to each weight vector, their mutual distances 
remain unchanged. Only the center of mass is shifted, and a simple equation for 
its movement can be found: 

N 

= y.^ + T7^S. (3.12) 

i=l 

^2+ ^ c^ + ^^-CS+^. (3.13) 
MM ^ ^ 

To average over x-C^S in the thermodynamic limit, it is helpful to split the hidden 
fields hi = x-Wj into contributions from the ccntcr-of-mass, — xC, and relative 
fields hi = x-Tj. For a given /i*", one can then average over x: 

TV 

X-C5 = -|/i^|sign(^sign(/i^)sign(/i[ + h^)). (3.14) 
1=1 

The quantity sign (/i*") sign (/i[ + h'^) is a random variable with mean eri{\h^\/ \/2) 
and variance 1 — erf(|/i'-^|/\/2)^. In a linear approximation for small \h'-^\, I can 
replace this by mean y/2/^\h'^\ and variance 1. It becomes clear in Eq. (3.15) 
why this linear approximation is admissible: in the range where it is no longer 
valid, the effect of h^ has already saturated. 

For sufficiently large N, one can use the central limit theorem [69] to show 
that sign(/i'^)sign(/i[ + /i'^) becomes a Gaussian random variable with mean 
^/2/nN\h'^\. Since the terms of the sum in (3.14) are anticorrelated rather than 
independent, the variance turns out to be (1 — 2/7r)A'" rather than N, analogously 
to Eq. (3.10). 1 This yields 

sign(5^sign(/i^)sign(/i[ + /i^))\ = erf ( V A^/ (tt - 2) | /i^ | ) . (3.15) 

i=l / 



^This was the source of confusion in Ref. [21]: the anticorrelation was not taken into account 
properly. In a guessed fit to simulations, the factor tt — 2 that appears in the calculation was 
then replaced by 1. 
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Figure 3.2: Fixed point of C vs. 77: simulations with M — 100 agree well with 
Eq. (3.17). The limit for ^ 00 is C = V2^r]/A. 

Since is a Gaussian variable with mean and variance C^, the average over 
can now be evaluated. The following differential equation for the norm of 
the center of mass results: 



The fixed point of C, which can be plugged into Eq. (3.10) to get a'^ /N as a 
function of r] and N , is 



(see Figs. 3.2 and 3.3). 

For small C, h'^ has only a small influence on the decision of the majority. 
If the majority disagrees with sign(/i'"), the learning step has a positive overlap 
with C. This leads to C oc as 77 — > 0. 

If C is large, the majority of perceptrons will usually make the same decision 
as C, which then behaves like the single confused perceptron: C \/2^r}/A if 
Krf — ^ 00 - compare to Eq. (2.7). 

For small C, the majority may not coincide with sign(x-C). In that case, the 
learning step has a positive overlap with C, leading to C oc as 77 — > 0. 

The last points are important, since they relate the Minority Game to an- 
tipredictability. They are therefore worth a second look. The probability that 




(3.16) 




(3.17) 
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Figure 3.3: Fixed point oi a'^ / K vs. 77: the combination of Eqs.(3.10) and (3.17) 
shows that sufficiently small learning rates lead to a"^ / K < 1. 

the majority agrees with the output of the center of mass can be calculated eas- 
ily. The most important step has already been done, in Eq. (3.15). This can be 
integrated over the distribution of /i"" to get 



Prob(sign(> Uj) = sign(x-C)) = 1 arctan 

i 

This result is compared to simulations in Fig. 3.4. It shows the crossover of the 
time series from essentially random behavior to a high probability of agreeing 
with the center of mass. In that limit, the ensemble of players could almost be 
replaced by a single effective player (represented by a confused perceptron) for 
the purpose to time series generation. 

Concludingly, neural networks that learn the output of the minority with 
Hebbian learning can coordinate well in the Minority Game, reducing the measure 
of global loss a'^/N by a factor of up to 0.363. However, if learning rates are not 
small enough, learning causes a positive overlap between the agents, leading to 
herding behavior and large global losses. The transition between anti-correlated 
and correlated behavior is a smooth crossover, not a phase transition. 
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Figure 3.4: Probability that the center of mass determines the decision of the ma- 
jority: Comparison between Eq. (3.18) and simulations with M = 100. Agree- 
ment is excellent for = 51, but not so good for = 21. 

3.5 Johnson's evolutionary MG 

N. F. Johnson and coworkers introduced another simple set of rules in Ref. [70]: 
each agent has access to a history table that records the outcome that followed 
each possible history pattern /j, upon the last occurrence of /i. Also, each agent 
i has an individual probability pi. In each round, each agent looks at the entry 
in the history table corresponding to the current pattern yU*, and chooses this 
entry with probability pi] otherwise, he chooses the opposite of the last "winning 
option" . 

The learning algorithm in Johnson's scenario is evolutionary: agents who 
are in the minority gain one point, whereas those in the majority lose one. If 
an agent's score falls below a certain threshold d < 0, the agent's strategy is 
modified, i.e., pi is changed. The distribution of ps converges to a stationary 
state where extreme probabilities p ~ and p ?a 1 are much more likely than 
intermediate probabilities, as seen in Fig. 3.5. 

As Burgos and Ceva pointed out in Ref. [71], the history table is essentially 
unnecessary in this variation of the game: all players use their individual pi for 
each of the possible histories, and there is complete symmetry between +1 and 
— 1 . If one is only interested in the distribution of ps and not in the time series 
generated by the game, one can therefore examine an equivalent game in which 
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Figure 3.5: Stationary distribution of switch probabilities p in Johnson's evolu- 
tionary MG. Simulations used N = 1001 and a threshold d = —10; however, 
results hardly depend on these parameters if is large enough. 

each player chooses +1 with probability pi and —1 with 1 — Pi. One analytical 
approach was given in Ref. [72]; another theory that models pairs of players 
whose decisions cancel each other with a given probability was proposed in Ref. 
[73]. Temporal oscillations were taken into account in a newer study, Ref. [74]. 

3.6 Reents' and Metzler's stochastic MG 

Similar in spirit is the strategy suggested by Reents, Metzler and Kinzel in Ref. 
[75]: players hke to stick to the option they chose in the previous round. If a 

player wins, he has no motivation to change anything, and will choose the same 
action again. If a player i loses, he will still choose the same action with a 
probability 1 — pi - after all, times could get better again. However, there is a 
probability pi for the player to decide that things have to change - he chooses 
the opposite action. 

Since each player only remembers his current state, the probability of finding 
the system at a given state at t + 1 only depends on the state of the players at t 
and the parameters pi and N. The problem is thus a one-step Markov process, 
and the standard treatment is to calculate the probabilities of finding the game 
in each possible state. 
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If each player has the same probabihty p, the game is fairly easy to describe 
mathematically: it is unnecessary to keep track of the whole set of {(7j}^i; in- 
stead, one can consider 

m^\Y.^,{t). (3.19) 

i 

The possible values k that K{t) can take are half-integer and run from —N/2 
to N/2 in steps of 1. This choice of variables may seem awkward, but it is the 
most convenient. The other possibility, A — cTj, results in a large number of 
factors of 1/2 in the calculation. Furthermore, one player switching sides results 
in a change in K of one unit, which is easy to remember. 
The probabilities to be calculated are 

7rfe(i) = prob {K{t) = k) , (3.20) 

and the dynamics is defined by the transition probabilities 

Wm = proh{K{t + 1) = A; I K{t) = i). (3.21) 

To shorten notation wc consider the probabilities 7rk{t) as components of the state 
vector Tz(t) = (vr_jv/2(t), . . . , T^N/2{i)y ■ The number of players in the majority at 
time t is N /2-\-\K{t)\. Since the individual players perform independent Bernoulli 
trials, the transition probabihty Wke. = W{i — > k) from a state with K{t) — £ to 
K{t -|- 1) = A; is given by the binomial distribution: 

p)^+'= for£>0, 

N 7 

p)"'' for£<0. (3.22) 

This stochastic process may be considered a random walk in one dimension, where 
steps of arbitrary size with probability (3.22) are allowed only in the direction of 
the origin K — Q. 

Given the initial state 7r(0), the state n{t) is updated at each time step by 
multiplying it by the transition matrix W: 

7r(i + 1) = W7r(f) . (3.23) 

The mathematical theory dealing with this kind of problems is that of Markov 
chains with stationary transition probabilities [53]. Since (W^)^^ > 0, the chain 
is irreducible as well as ergodic [76], which implies that regardless of the initial 
distribution the state 7r(t) converges for t cxd to a unique stationary state 
7r(cxo) = tt"*. In view of Eq. (3.23), tt* corresponds to an eigenvector of W with 
eigenvalue 1 : 

Wtt^^tt^ and J^vr^^l. (3.24) 

k 
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This eigenvector, which characterizes the equihbrium state of the system, can, in 
principle, be calculated analytically for any N. However, the results are generally 
rational functions in p whose degree is roughly - they become rather awkward 
for larger N, and the computational effort to find them is intolerable. Analytical 
solutions in a more agreeable form can be found in the two limiting cases of small 
p {p — 2x/N, N ^ oo and large p {p — const, p ■ N ^ 1). 

In intermediate regimes, while analytical results are hard to come by, numeri- 
cal solutions are easy to obtain. The problem can be simplified by exploiting the 
symmetry W-k-i = Wke, which implies the symmetry Tri^, = tt^ of the stationary 
state. By rewriting the eigenvalue problem for the independent components of 
TT* and exploiting the knowledge that the largest eigenvalue is 1, the stationary 
state is the solution of a linear equation with N/2 variables. It can be calculated 
numerically up to ~ 1200 in reasonable time with standard linear algebra 
packages. 

3.6.1 Solution for small p 

If p is so small that only 0{1) players switch sides, it is fairly intuitive that 
the stationary state is localized within 0{1) of the origin; i.e., only states with 
\K\ = 0{1) have an appreciable probability. This suspicion is confirmed by the 
analytical calculation in the limit p — 2x/N, N ^ oo. The following approxima- 
tion is good as long as a; -C A^. 

In this case the matrix elements Wke can be approximated by Poisson proba- 
bilities [53]: 

k—l 

Wke ^ ^fe^ = (^T^)! for^<0' (3-25) 

where, again, l/m\ for negative m has to be interpreted as zero. In the limit 
N ^ oo the solution of the problem is an infinite component vector tt* satisfying 
the eigenvalue equation together with the proper normalization: 

W^7r^ = 7r" and J^tt^^I. (3.26) 

k 

This problem was solved by G. Reents, who noticed that the moments of the 



54 



stationary distribution follow simple rules: 



1\ _ X 
2/ ~ 2 ' 

etc. 

These in turn determine the characteristic function of 7r|, and a Fourier transform 
finally leads to 



Ch. Horn proved in a rather lengthy calculation that Eq. (3.28) indeed satisfies 
the eigenvalue equation (3.26) [73]. 

nl can also be expressed by the incomplete gamma function: 

A comparison with numerically determined eigenvectors of the matrix (3.22) for 
N = 801 gives excellent agreement, as seen in Fig. 3.6. The distribution is roughly 
flat for small \k\, has a turning point near |A;| = a; and falls off exponentially with 
k for larger values of From Eq. (3.28), the variance cr^ = ((2A;)^) can be 
calculated: 

cr^ = l + 4a; + -x^. (3.30) 

3 

For small x, this approaches the optimal value = 1 that occurs if the majority 
is always as narrow as possible, but even for larger x, does not increase with 



3.6.2 Solution for large p 

As mentioned before, an (approximate) analytical solution can also be found if p is 
of order 1 andp A?^ ^ 1. To handle this regime, I introduce a rescaled (continuous) 
coordinate k = k/N = '^iO'i/{2N), the range of which is —1/2 < k < 1/2. 
Multiplied by A^, the stationary state 7r| for large A^ turns into a probability 
density function 7r*(K), and the matrix Wke becomes an integral kernel W{k, A); 
consequently, the eigenvalue equation Eq. (3.24) is transformed into an integral 
equation: 

tt'{h) = [ W{K,X)Ti'{X)dX and I t:' {k) dn = 1 . (3.31) 
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Figure 3.6: Stationary solution tt^ for p = x/{N/2). The numerical solution for 
N — 801 (symbols) is in very good agreement with the analytical solution for 
^ oo. 



Numerical calculations show that the eigenvector 7r''(/t) takes the shape of two 
Gaussian peaks centered at symmetrical distances ±k,q from the origin (see Fig. 
3.7). The physical interpretation is that the majority switches from one side to 
the other in every time step. Since approximately {kq + l/2)pN agents switch 
sides every turn and the distance between the two peaks amounts to a number 
oi 2 KqN agents, we get Kq = p/(4 — 2p). 

This reasoning can be made more precise, and also the width of the peaks for 
large but finite N can be calculated by the following argument: The well known 
normal approximation for the binomial coefficients in Eq. (3.22) leads to 



W{K,X)^NWke ^ 
where /(A) = 
and s^(A) = 
A double Gaussian of the form 



1 



/2ns{\) 



exp 



2 



(1 -p) A-sign(A)| 

p(l-p)(l + |A|) 
N 



exp 



,K + Kq) 

2 62 



+ exp 



2 62 



(3.32) 



(3.33) 
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Figure 3.7: Stationary solution 7r^(ft;) for p — 0.4. With increasing iV, the width 
of the peaks becomes narrower. 

is transformed by the integral kernel (3.32) into a double peak of the same type 
under two approximations: 

First, one must approximate s^(A) of (3.32) by s^(±ko) in the integral equa- 
tion. This is justified if the peaks are narrow, i.e., 6^ ^ 1. 

Second, one must assume that the peaks are well localized on each side of 
0, i.e., the tail of the Gaussian on the positive side does not have a significant 
contribution in the negative range and vice versa. The mathematical condition 
for this is <^ Kq. Also, the peaks should be well separated from —1/2 and 
+1/2. If these conditions are fulfilled, one can extend the integral over each peak 
from — oo to +oo. 

By requiring 7r*(«;) from Eq. (3.33) to satisfy the eigenvalue equation (3.31), 
one finds 

The result for kq confirms the simple argument given above, whereas the term 
for IP' is slightly surprising: it does not depend on p in the leading order, i.e., it 
is not simply the number of players who switch sides. Instead, the width of the 
peaks is the result of two conflicting mechanisms: the variance in the number 
of players who switch sides makes the peak wider. It is counteracted by a self- 
focusing mechanism: for example, if the system is aX, k — KQ-\-h aX, time t , it will 
on average be at —kq + {1 — p)h va. the next time step, i.e., the average distance 
from the center —kq has decreased by a factor (1 — p). 
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Eq. (3.34) also allows to check whether the assumptions made for its deriva- 
tion are true for a given p and A^, i.e., whether the approximation is self- consistent. 
For example, for p = x/N, nl/b^ ^ for — cxo according to (3.34), so one 
cannot expect the formation of double peaks in this limit. The crossover from 
single-peak to double-peak distribution occurs for p oc 1/ \/N. 

If the conditions are fulfilled, is easy to integrate over the probability distri- 
bution Eq. (3.33) to get an expression for cr^: 



N 



{2-p) 



,(A^p' + 4(l-p)). 



(3.35) 



This agrees well with numerics well if the condition kq ^ h is fulfilled, i.e., for 
sufficiently large p and N, as seen in Fig. 3.8. Numerical evidence shows that 
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Figure 3.8: cr^ for several values of p and N , compared to predictions by Eq. 
(3.35). 

0"^ scales like oc N'^ p^ even when the condition kq 2> 6 is not fulfilled, as for 
p oc 1/\/N. In this case we found cr^ oc A^, like in the original game of Challet 
and Zhang and for random guessing. However, the behavior of the system is 
different: the distribution of k can have a double peak structure (depending on 
the proportionality constant) rather than a Gaussian shape, and the minority is 
still very likely to switch from one side to the other at consecutive time steps. 

3.6.3 Mixed strategy populations susceptibility to noise 

In Ch. Horn's diploma thesis [73], computer experiments were presented in 
which populations with different strategies (decision tables, neural networks, the 
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stochastic strategy etc.) played in a common minority game, and the success of 
the strategics depending on the fraction of players that use them was studied. 
It turned out that in most cases, each of the three mentioned strategies is ro- 
bust to the presence of a small number of players with other strategies (meaning 
that the coordination is not completely disrupted by the presence of "strangers" ) 
and that populations that have a large majority of the total population usually 
score better than the small population using another strategy - the strategies are 
evolutionarily stable [77]. 

The susceptibility of a population of players using the stochastic strategy to 
the presence of other agents can be calculated, provided that the output of these 
other agents is not strongly correlated to the dynamics of the stochastic players 
- in other words, the collective output of the alternative-strategy agents can be 
treated random number. 

To be more precise, let us assume that among a total on N players, a pop- 
ulation of Ns players using the described stochastic strategy competes with 
Nji — N — Ns players who are simply guessing randomly. The output of guessers 
can be treated as a random variable Kr, analogous to Eq. (3.19), with a proba- 
bility distribution 

1 / Nr 



P-b(K« = ^) = ^^, + ;,/2j- (3-36) 

These guessers only influence the stochastic players' coordination if they overrule 
their decision, that is, if siga{Ks + Kji) ^ sign(ii'5). The probability of that is 

MK,) = -ProHlK^l > \m = - -^^^ 2-» (^ ^ J . (3.37) 

If the decision is overruled, the stochastic players who are on the minority side 
among their population roll the dice to see whether they change their opinion. 
In terms of a random walk, there is now a probability Po{k) that the step from 
position k leads away from the origin rather than towards it. The entries of the 
transition matrix have to be changed accordingly: 

+ (f - P)^-'Po{i) for £ > , 



W,e = (^l_^^y'-'{l-p)^-'{l-Po{ 



+ i ^ ]p'~\l -p)^^'Po{i) for £ < 0. (3.38) 

The eigenvalues of this matrix can again be calculated numerically. The agree- 
ment with simulations confirms the approach (see Fig. 3.9). In the limit where 
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p — 2x/N and Ns — > oo, the distribution depends on the absolute number, not 
the ratio, of random players. 




Figure 3.9: Distribution of the output of A^^ stochastic players in the presence of 
Nji = 101 — Ns random players. Simulations are taken from Ref. [73], numerical 
results are calculated using Eq.(3.38). 

Although the width of the distribution of Ks increases slightly from the pres- 
ence of guessers, this drawback does not have a drastic influence on the average 
gain (or rather, loss) of the stochastic players: if the guessers overrule the decision 
made by the stochastic players, that means that the majority of those wins, and 
they actually increased their population's gain at the expense of the guessers. 
This argument can easily be quantified: upon averaging the gain using the calcu- 
lated probabihty of Ks, the gain is -1-21X5! "^i^^ probability Po{Ks) and —21X5! 
otherwise: 

{gs)^^J2''Ksi^\^s\Po{Ks)-2\Ks\{l-Po{Ks))]. (3.39) 

Ks 

Likewise, the gain of the random players can be calculated using the known 
probability distributions for Ks and Kji. Agreement with simulations is excellent, 
as seen in Fig. 3.10. 

As pointed out in [73], the guessers would be better off in a pure population 
of guessers, but worse off if the stochastic players were simply absent, and the 
Nr guessers made up the whole population. The average loss of isolated random 
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Figure 3.10: Two populations of Ns stochastic players with p — 1 and Nr = 
101 — Ns random players compete. The numerical results agree excellently with 
simulations taken from Ref. [73]. 



players is easy to calculate: 



Nr/2 



and it is worse than that of the mixed population (see Fig. 3.10). 



(3.40) 



3.6.4 Correlations in the time series 

It was mentioned in Sec. 3.6.2 that in the case of fixed p and large A^, the 
majority switches sides at every time step. On the other hand, it is obvious that 
for p -C 2/iV no player will change his opinion during most time steps, leaving 
the majority side unchanged. What happens between these extremes? 

To answer this question, one can calculate the one-step autocorrelation func- 
tion {S(t)S(t + 1)) of the minority sign S{t) from the stationary probability 
distribution and the transition matrix: 



{S{t)S{t + l)) = ^sign(A;)sign(07r^W^H. 

k,l 



(3.41) 



Otherwise, one can simply measure the autocorrelation in a simulation. Figs. 
3.11 and 3.12 show the autocorrelation function in the two limits treated in Sec. 
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Figure 3.11: One-step autocorrelation of minority signs generated by the stochas- 
tic minority game in the hmit p — 2x/N, A^— >oo. 

3.6.1 and 3.6.2, respectively. In the hmit olp — 2x/N, the correlation is positive 
for small x, as expected: if fewer than one player change sides on average, the 
majority is not likely to switch. For x ^ 1.1, {S{t)S{t + 1)) = 0. Beyond that 
point, the autocorrelation function is negative, and it goes to — 1 as a; — >■ oo. 

For fixed p, {S{t)S{t + 1)) = goes to —1 rapidly with increasing N. The in- 
terpretation in terms of the probability distribution is that an appreciable overlap 
between the two Gaussian peaks in tt* corresponds to a non- vanishing probability 
that the minority sign stays the same. As N increases, the peaks get narrower, 
and the overlap vanishes. 

3.6.5 Relation to the original MG 

The stochastic MG as it was presented here has only a one-step memory. It is 
possible (although not necessary) to include a longer history of minority decisions, 
analogous to the early papers on the evolutionary Minority Game [70]. In that 
case every player would keep an individual decision table that tells him what to 
do if a given sequence of minority decisions occurs, similar to the tables that the 
players in the original Minority Game use. However, instead of changing to an 
entirely different table, each player changes individual entries in his table with 
probability p if he loses. It is easy to see - and a similar argument was given 
in Ref. [71] for the global decision table in Johnson's variant - that the entries 
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Figure 3.12: For fixed p and increasing A^, {S{t)S{t + 1)) goes to zero as the 
overlap between the peaks in tt* decreases. See also Fig. 3.7. 

for different histories are completely decoupled. Each entry or row in the table 
corresponds to a one-step Markov process as described above, influenced only by 
the last time that the same history occurred. 

In that sense, introducing a history changes the properties of the time series 
generated by the decisions, but not the average loss of the players in the stationary 
state. It might become relevant if one mixes players with different strategies and 
different memory length, who are susceptible to correlations on different orders 
of the DeBruijn graph (see Sec. 2.4.4 and 2.4.5). As Sec. 3.6.4 showed, there is a 
crossover from persistent to antipersistent behavior as the switching probability 
is increased. If p is constant and is large enough, the ensemble of players can 
essentially be replaced by one effective player, who generates a time series with 
the properties studied in Sec. 2.4. 

The presented strategy shows similarities to the behavior of the original mi- 
nority game in the limit a — > 0, i.e., a small ratio of possible histories compared 
to the number of players [64]. In the extreme case where each of the two decision 
tables that each player keeps consists of only one entry (i.e., only the last minor- 
ity decision counts), the output of roughly 50% of the players is set to either 
or —1 independent of the history, whereas the other 50% can choose their output 
depending on their current score. Out of those players, those who chose the mi- 
nority side will repeat their decision, whereas the update of the scores will cause 
some of the losers to switch sides. As mentioned before, it has been observed that 
(7^ shows a crossover from cr^ oc A?"^ to cr^ oc 1 depending on the initial differences 
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in players' scores [63, 64]; these differences determine the typical number of play- 
ers who switch sides when they lose. This situation is very much the same for 
players who use perceptrons with only one input unit [66] , where a solution with 
(7^ = 1 is reached if the learning rate rj is smaller that the differences between 
the initial weights of the players. However, in the absence of frozen disorder, the 
stochastic strategy obviously does not lead to a separation of agents into frozen 
and oscillating players. 

3.7 Multi-choice MGs 

A generalization of the MG that was suggested by 1. Kantcr and studied first 
by Ein-Dor, Metzler, Kinzel and Kanter in Ref. [78] was to lift the restriction 
to two alternatives and allowing for Q choices, or options, or "rooms" , instead. 
The choice that is picked by the fewest players wins; in case of a tie, one of the 
rooms with the lowest attendance is picked at random and declared the winning 
option. The motivation is clear, since many real-world problems offer more than 
two choices: an investor has many different stocks to choose from, more than 
two routes may connect two cities, and a predator may have more than two 
possible hunting grounds. In a sense, the generalization from the binary MG to 
a multi-choice MG is analogous to the generalization from an Ising model [79] to 
a Potts model [80]. However, this analogy does not apply to the mathematical 
treatment of the standard MG using Ising spin variables [61] - there, the spin 
variables represented the choice between the two decision tables rather than the 
two actions. 

3.7.1 Random Guessing 

The default strategy against which any more sophisticated approaches have to be 
measured is that of randomly choosing one of the options. It is not as obvious as in 
the binary MG what numbers come out, or even what quantities one should look 
at. I will therefore devote a few words to that topic. The following calculations 
were largely done by Liat Ein-Dor. 

The macrostate of the game can be described by the set of numbers {Ng} of 
players who choose each option q = 1, . . . ,Q. However, the only quantity relevant 
for global gain or loss is the number iV,„j„ = Tainq{{Nq}) of players in the winning 
room. In analogy to the binary MG, one can define a variance 

^Ln = {Nmin - N/Q)t. (3.42) 

Analogous to the probability distribution of iV_|_ in the binary MG, which, for ran- 
dom guessing, follows a binomial distribution, the joint probability distribution 
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of the NqS is given by the multinomial distribution: 

Averages over Nmin can be evaluated by exploiting the symmetry of the system: 
I assume that Ni is the smallest occupation number (which happens with a 
probability oil/Q), include a factor of Q to correct this assumption, and sum over 
the other Ng with appropriate boundaries. For example, (7^j„ can be calculated 
as follows: 

N . .2 N 

Ni=0 ^ N2,N3,-=Ni 

Numerics show that for large N, cr^^^ is proportional to N, as expected. To cal- 
culate the proportionality constant for N ^ oo, one can introduce new, properly 
scaled quantities to measure the deviations from equidistribution: 

A^, = N/Q + eg^/N, (3.45) 
and approximate the factorials in Eq. (3.43) using the Stirling approximation 



N\ a; V2TrN. (3.46) 
The result 

P({6,}) oc exp (^-| ^) ^ ''^ (3.47) 
can be used to numerically integrate the continuous expression for {el 



mini 



, /-oorfwxr(n,>2^e,)p(K}) 
/-oo^^i/r(n.>2^^.)^(K}) 

Structurally, this is an integral over the square of the smallest of Q Gaussian 
random numbers. The only difficulty is the global constraint for the sum of e^. 
Results of the numerical integration for Q = 3, 4, 5 and 6 are o^^jN — (e^j„) ~ 
0.313, 0.322, 0.320, and 0.309, respectively 

Another quantity that is easier to calculate and often just as meaningful as 
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If there is no systematic preference for one of the options, it is sufficient to look 
at one of the options, e.g. the option 1. For random guessing, this quantity takes 
the form 



i=l 



Q 

= ^ (^) ■ ^'-''^ 

The form of the expression in Eq. (3.50) will be useful later on, since it reduces 
the calculation of to the average probability that two agents agree on their 
output. 



3.7.2 Neural Networks 
Network architecture 

The first scenario for the multi-choice minority game that was studied in detail 
[78] included an ensemble of neural networks making their choice based on an 
M-dimensional vector whose components are the most recent minority decisions. 
Since both the input and the output of the networks are now integer numbers in 
the range between 1 and Q, the architecture of the networks has to be able to 
handle this. A simplified version of a Potts perception [81] which is introduced 
in greater detail in Chapter 4, seems appropriate. In this simplification, each 
player has a weight vector Wj, from which Q hidden fields are calculated. Each 
hidden field hiq only gets contributions from components of the weight vector if 
the corresponding component of the vector is equal to q: 

M 

hiq = ^WijSxj,q. (3.52) 

j=i 

This architecture is limited compared to the full Potts perceptron: first, there 
is no interaction between the different input values - the appearance or non- 
appearance of some qi in the pattern does not infiuence the hidden field for 
any other q. Second, the options are completely interchangeable: if one would 
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interchange all occurrences of qi and q2 in the pattern, one would interchange 
hq^ and hg^ as well. This seems reasonable, given that the options are completely 
equivalent a priori. 

Each agent determines his output cXi by choosing the option that corresponds 
to his largest hidden field: 

(7i = {k\hik = maxhiq} (3.53) 

From these individual outputs, the occupation numbers Ng = 6aj,g are calcu- 
lated, and the winning option S is determined: 

S ^{q\Ng^mmNm} (3.54) 

m 

If the learning rule is Hebbian, analogous to Sec. 3.4, an analytical approach is 
possible. The update rule for the vectors is therefore 

^ir-< + J^iQ^^^,s-l). (3.55) 

The rule does what one would expect: it increases the hidden field for option 
S and decreases the fields for the other options. Also, the same update step is 
applied to all weights, leaving relative vectors unchanged. 

Just like in Sec. 3.4, weight vectors can be split into a center of mass C = 
Y2wi/N and relative vectors = Wj — C. The initial conditions are such that 
vectors have a norm of |rj| = 1 and symmetrical overlaps of rj-rj = —1/[N — 
1) (the first condition is a choice of length scale, the second is automatically 
fulfilled approximately if vectors are chosen randomly with a large number M of 
dimensions). If one sets C = |C| = 0, this is indeed the optimal configuration 
that can be achieved without breaking the symmetry between the perceptrons. 

For random patterns and large N, the hidden fields hig arc Gaussian random 
numbers. It is convenient to rescale them to variables hig which obey (highir) = 
6g^r and {highjr) = RSg^r for i ^ j, where it! = {wi-MVj/{wiWj))ij is the overlap 
between different weight vectors. 

The interesting task is calculate cr^j„ for non- vanishing R, i.e., for corre- 
lated weight vectors. This is apparently only feasible for small correlations 
R — 0{1/N), and even there only in a rather convoluted way. The following 
calculation was basically done by L. Ein-Dor and I. Kanter. However, its presen- 
tation in Ref. [78] is somewhat cryptic and slightly wrong, and I will attempt to 
give an understandable and correct account of it. 

Calculating cr^ and cr^j„ 

The first step is to calculate the variance o"^, which can be broken down to an 
average over two players, as pointed out in Sec. 3.7.1. The question is thus: what 
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is the probability that two perceptions i and j with a small overlap R give the 
same output, let us say, ai — aj — 11 Equivalently, what is the probability that 

^ i ^ i ^ j ^ j 

hi > hq A > for all ql The hidden fields can be combined into one vector 
h = {hi, . . . ,hQ,h{, . . . , hq) of Gaussian variables with the correlation matrix 
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(3.56) 



The joint probability distribution of these variables follows 

Pih) = ^ =exp (--h^C-'h ] . 



(3.57) 



For general R and Q, this expression is fairly complicated. We simplify it by 
approximating to the first order m. R. It turns out that 
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(3.58) 



which can be verified by checking that C-C^^ yields the identity matrix multiplied 
with 1 - 

In that approximation, the probability distribution from Eq. (3.57) looks 
more amiable: 



F(h) 



(3.59) 



Now the probability that both networks give an output of 1 can be calculated. 
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again in a consequent approximation to first order in R: 

Prob((jj = cXj = 1) 



/OO fOO . rh, Q . fh, Q 

dh\ / dh[ / Y\ dfiq / JJ^ dfiq X 
-OO J —OO J —OO „_2 J —OO „_2 




dh\ J dh{—exp i-^{h[ + h{ )\ (l + 2Rh[h{j 



X 



Q-l 



, R{Q-1) ( 



Q-2 



0^ + ^ 
+ W - 1) 



/OO T 
(i/i^exp(-/iV2)$(/i)«-^ 
-OO vStT 



dh—eM-h^Mhf'^ 



(3.60) 



The first term in Eq. (3.60) is the defaufi resufi for uncorrelated vectors. The 
second term is due to the correlations between h\ and h{, and the third term 
represents correlations between /i* and /i^ for g > 2. For some reason, Ref. [78] 
only lists the first and third term. 

Using this result, can be calculated according to Eq. (3.50): 



a\R)/N 



Q-l 



+ {N- 1)RI, 



(3.61) 
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Figure 3.13: Variance a'^/N as a function of R in the multi-choice minority game 
with neural networks. Eq. (3.61) agrees well with simulations using M — 200. 
All simulations in this section use random patterns. 

where / stands for the two integrals in Eq. (3.60). Plugging in the optimal value 
R — —1/{N — 1), one gets an optimal a'^ /N which is independent of N, just like 
in the binary-choice MG. 

Considering that Eq. (3.61) is a first-order approximation, it agrees surpris- 
ingly well with simulations, as seen in Fig. 3.13. 

What is still left to do is to deduce cr^j^ from cr^. This can be done in a 
surprisingly easy way: for large N , the occupation numbers Nq are still Gaussian 
random numbers whose variance is now given by Eq. (3.61) instead of (3.51). 
Therefore, the integral necessary to calculate cr^j„/-/V — (emin) has the exact form 
of Eq. (3.48), and if e is rescaled properly, one can reuse the integrals evaluated 
to get (J^in the random case: 



The values on the right hand side can be taken from Eqs. (3.51) and (3.48), while 
the enumerator on the left hand side is given by Eq. (3.61). 

Just how good is perfect coordination (/? = —!/ {N — 1)) compared to random 
guessing? As the calculations have shown, this depends only on Q. The ratio 
a^{R = -1/(A^ - l))/cr'^{R = 0) is shown in Fig. 3.14. For (5 = 2, one gets the 
ratio 1 — 2/7r, as calculated in Sec. 3.4. For larger Q, the ration slowly converges 
to 1: the benefit from coordination decreases with Q. 
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Figure 3.14: Improvement by coordination: cr^(i? = —^/{N — l))/a'^{R = 0) 
as calculated from Eq. (3.61). For larger Q, the benefit gained from learning 
decreases. 



Learning Dynamics 

The preceding paragraphs showed how o"^ depends on R, and that R depends on 
the norm of the center-of-mass vector C. To finish the argument, one can show 
that the proposed learning rules leads to C — > 0, and thus optimal coordination, 
for 77 — > 0. The calculation, which is again by L. Ein-Dor, is roughly analogous 
to that for the binary-choice MG, but takes a shortcut at a convenient spot. 

To find C in dependence on 77, K, and Q, one starts with the update rule for 
C, which is analogous to Eq. (3.55): 

Cr = C'j + |:(Q5.^,5-l), (3.63) 
takes the square of it, averages over random patterns, and sums over j: 

For M — > 00, this turns into a differential equation for C, where one infinitesimal 
time step da — 1/M corresponds to one learning step: 

^ = 2rjQ C,S.,,,s) + ri\K - 1). (3.65) 
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At this point, Ref. [78] makes a convenient approximation: it is assumed that 
the output of the ensemble of networks always follows the output of the center- 
of-mass. While possibly true for large N and large r], this is certainly not correct 
for small rj - the range we are interested in! However, I will first try to explain 
the approximation, and then follow it to the end to see if it makes a difference in 
the final result. 

The hidden fields hig of each player can again be split into two contributions, 
one from the random vector and one from the center-of-mass: 

M M 
hiq = S^m,qrim + Yl ^^rn,gCm- (3.66) 

m m 

The fields induced by the center-of-mass, hg = J2m ^xm,q^m, thus act as a bias 
on the random decision of the relative vectors. If C is of order 1, the center-of- 
mass heavily influences the decision of each individual. However, to influence the 
decision of the minority, it only has to tip the decision of 0{\fN /Q) players. 

It is messy to calculate the probability that the output of a single player follows 
the smallest hidden field of the center-of-mass, not to mention the probability that 
the ensemble joins this decision. The probability can be measured in simulations, 
though, and results are shown in Fig. 3.15 for Q — A. As expected, the probability 
that the room with the largest occupation corresponds to the largest h'^ goes to 
1 as the overlap between weight vectors increases. However, the probability that 
the smallest occupation is given by the smallest hidden field takes high values for 
intermediate cos(6') and the decreases again. The reason for this is not yet clear. 

If one nevertheless assumes that the option with the smallest is the minor- 
ity decision S of the ensemble, CjSx^^s is simply the smallest of Q uncorrelated 
Gaussian random numbers with zero mean and variance C'^/Q, and the average 
over it can be calculated. Again, let us assume that the winning option is 1, and 
correct the assumption by a factor of Q: 



oo 




oo 

/•oo I / /,C2 

''"i a>2 



dh„ — , exp , ^ / ^ 

^ 1"^ h 

= C^/Q dh^exp{-hy2){l-^h)f-\ (3.67) 
Joo v27r 

This is proportional to C, and together with Eq. (3.65) one gets for the fixed 
point 

where J is the integral expression in Eq. (3.67). 
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Figure 3.15: Perceptions in the multi-choice MG: probabilities that the least 
occupied room Smin corresponds to the option Sc,min with the smallest bias h^, 
that the most occupied room S^ax corresponds to the largest bias Sc,max, ^-nd 
that the individual decisions CTj correspond to the respective largest and smallest 
bias, as a function of the weight vectors' mutual angle cos(^). Simulations used 
g = 4, AT = 41 and M = 100. 

As mentioned before, this approximation is never truly valid. It is the same 
approximation that replaces the center-of-mass vector in the binary-choice per- 
ceptron with a single confused perceptron (sec Sec. 3.4.2), but the existence of 
more than two alternatives complicates the situation. 

Nevertheless, simulations indicate that C indeed goes to as ^ oo for small 
A*", although not linearly. This is again analogous to the binary-choice-perceptron, 
where C oc was found for small rj. It does not change the conclusion that 
takes its optimal value for very small learning rates. 

3.7.3 Decision tables 

The possibihty of applying other rules of behavior (such as Challet and Zhang's 
decision tables) to the multi-choice minority game was only hinted at in Rcf. 
[78]. 1 will present a straightforward generalization of the standard MG and 
some simulations which indicate that the behavior of the multi-choice MG is 
very similar to the two-choice MG. A more thorough analysis based on a slightly 
modified generalization was presented by Chow and Chau in Ref. [82]. 
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The simplest generalization from the standard MG [60, 61, 62] is to give each 
player two decision tables with entries G {1, .... Q} for each of the possible 
histories /i. Scoring docs not have to be modified: a table receives a point if it 
would have predicted the correct minority room, and loses a point for an incorrect 
prediction. The table with the highest score is used. 

Since the number of table entries increases rather drastically, it simplifies 
simulations to substitute the time series/ history scheme by a number of p possible 
patterns that are picked at random at each time step. In the conventional MG, 
this alters results only very little [59]. 

Result can be seen in Fig. 3.16: curves for a"^ /N look quahtatively similar 
for Q = 2 (the binary case, apart from a factor of 4 that comes from different 
definitions of cr^), Q = 3 and Q = 4. For a — > oo, the values of cr^ approach those 
of random guessing {cr'^/N — {Q — 1)/Q'^), whereas at small a there is a crowded 
phase. 

As in the case of Q = 2, a preference for a certain response to each pattern 
II appears at etc and increases with a. This preference can be expressed as an 
information H in the following way: if is the probability that following a given 
pattern /i, the minority chooses option q, one defines 



The phase transition can be located by estimating where H becomes for N 
oo. As seen in Fig. 3.16, the critical value ac of the phase transition decreases 
with Q. 

3.7.4 Johnson's evolutionary MG 

The evolutionary Minority Game suggested in Ref. [70] and briefly described in 
Sec. 3.5 has several possible generalizations to Q options. In the first, players have 
a probability p to choose the option that was successful the last time. Otherwise, 
they choose one of the remaining options at random. In the second, players i 
have a set of probabilities {p^} to visit each of the possible rooms q = 1, . . . ,Q. 
If their score drops below a certain threshold, they discard their probabilities and 
choose a new combination of pf. 

Let us first consider the first generalization. It is fairly obvious that this 
prescription can do very little to improve coordination. Even in the extreme case 
where exactly N/ Q players choose the last winning option with probability 1 and 
A^(l — 1/Q) roll the dice with probability 1, the standard deviation is that of 
A^(l — l/Q) players choosing between Q — 1 options. The best thing that can 
happen is that the number of guessers is effectively reduced by a factor oil — l/Q. 

Simulations show that, compared to the case oi Q = 2. the self- organization is 
less pronounced for Q > 2. The probability density distribution P{pi) decreases 
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Figure 3.16: Standard Minority Game with Q >2 options. Curves for a'^/N and 
H look qualitatively similar, whereas the value etc of the phase transition depends 
on Q. Simulations use N — 101 and N — 201. 

monotonically towards higher p^; there is no double-peak structure (see Fig. 3.17). 
There is, in the stationary state, a chance oil/Q for the most recent option to win 
again, i.e., probabilities Pi distribute themselves to arbitrage away any systematic 
advantage: 



Furthermore, coordination is slightly improved, compared to completely random 
guessing. 

What is the cause of the relatively bad coordination in this evolutionary game? 
As pointed out in Ref. [73], the stationary distribution P{pi) is determined by 
two processes: the removal of agents with a given pi, i.e., the mean lifetime of an 
agent with pi in an environment with a given distribution P{pi), and the addition 
of new agents, i.e., the rules according to which the piS of removed players are 
replaced. In a dynamics in which new ps are chosen at random, the stationary 
distribution P{pi) is proportional to the mean lifetime L{pi) of a player with 
probability pi. In an environment with a wide distribution P{pi), having an 
extreme position apparently does not provide enough advantage to bring about 
a strong polarization of opinions. 

This can be remedied by a small change in the dynamics: a removed player is 
replaced by a (possibly shghtly modified) copy of one of the other players. This 
ensures, in the spirit of evolutionary biology, that "fitter" strategies spread more 
quickly than less successful ones. In simulations, it proves essential to add a small 
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Figure 3.17: Johnson's Evolutionary Minority Game with Q = 3 options and one 
probabihty pi of choosing the last winning option: if new players pick their pi at 
random, specialization is weak (solid line). If they imitate an existing player, the 
population segregates into two crowds (dotted line). 



"mutation" (a random number) to the value of p that is being copied, to allow 
the population to reach advantageous p-values even if the agents that held them 
initially died out due to fluctuations. In Fig. 3.17, this mutation is a Gaussian 
random number with mean and a = 0.005 that is added to pi with reflecting 
boundary conditions. 

For Q — 2 (binary decision), this leads to a split of the population into 
two groups: one with p ^ and roughly N/2 members, the other with p ^ 1 
and again N/2 members. This even leads to o"^ = 0{1). One can argue that 
this situation is analogous to Rccnts' and Metzler's stochastic strategy, with 
evaluation of one's strategy stretched out over a longer time: let us exploit the 
symmetry of the problem again and say that pi is the probability of choosing +1. 
If the population is already separated into two peaks at p ^ 1 and p ^ 0, these 
correspond to different populations that say +1 or —1, respectively. Players who 
have lost a certain amount pick a new pi - either they rejoin their previous side, 
or they switch sides. If few enough players die per round, this can lead to fa 1. 

For Q > 2, the population again segregates into two factions with extreme 
opinions {pi or 1). As explained above, even this split cannot do more 
than reduce the number of guessers, and hence cr'^/N, by a factor of {Q — 1)/Q- 
Numerical evidence shows that this is exactly what happens. 



76 



The second generalization, in which each player i has an individual preference 
for each room q, offers (at least in theory) a configuration of perfect coordi- 
nation: for each option, there is a fraction of 1/Q players who choose it with 
certainty. Analogous to the case of binary decisions, where probabilities close to 
or 1 are favorable, the preferred probability combinations would be in the edges 
of the simplex. A suitable order parameter is therefore the average self-overlap 
of agents' strategy vectors: 



which takes values from l/Q if each option is equally hkely, to 1 if there is com- 
plete specialization. The average of this quantity can be determined analytically 
for random vectors. The simplest way to generate a random vector on a simplex 
is to take q random number r^, equidistributed between and 1, and normalize 



This yields R2 = 2+121og(2)-91og(3) ^ 0.43026, R3 = 2-881og(2)+541og(3) ^ 
0.32812 etc.. These results, which can be well approximated by Rq 1.31/Q, 
are the baseline against which a coordination by evolution has to be measured. 

Simulations of the evolution process show that if a deceased player is replaced 
by one with a randomly updated strategy vector, coordination is and stays bad. 
The probability distribution of R is almost indistinguishable from that of random 
vectors, and {R3) merely increases to fa 0.445, compared to 0.43026 for random 
vectors. In a noisy environment, players who attempt to specialize in one of the 
options to not have a significant advantage. 

Again, the picture changes if new players are allowed to copy (with modifica- 
tions) the strategy of a randomly chosen other player. The imitation mechanism 
allows players to explore areas of parameter space that have a slim probability of 
being chosen at random, like the edges of the simplex. In fact, a strong special- 
ization takes place: in the stationary state, each player has a decided preference 
for one of the rooms, leading to high average R. 

The exact distribution of R depend on the details and parameters of the 
mutation mechanism. In Fig. 3.18, the following mode was chosen: a Gaussian 
random number with a = 0.005 is added to each component, with reflective 
boundary conditions at and 1. The resulting vector is normalized to^pf = 1. 

As I have shown, the mechanism of imitation improves coordination among 
agents, at least for the instances discussed in the last paragraphs. However, 
strictly speaking, it violates the assumptions made about the rules of the game: 
Although copying another player's strategy is not the same thing as making an 
explicit contract stating "Each round, I play this and that, and you play such and 
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them Pk = Tk/ J2q Tq. The average of R is then 
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Figure 3.18: Johnson's Evolutionary Minority Game with Q = 3 options and 
individual preferences: if new players' strategies are picked at random (a), the 
stationary distribution of self-overlaps R shows no specialization. This changes 
if new players imitate existing players with slight modifications (b). Simulations 
used N = 1001. 



such" , it amounts to the same thing if a player's strategy has a strong preference 
for one of the options. 

Interestingly, the "contracts" made by the agents by imitation do not follow 
the lines of "I play +1 each time you play —1, and one of us wins for sure" as one 
would expect. To the contrary, players "agree" to make the same statement most 
of the time! That this still leads to good coordination is a consequence of the 
slow adaption brought about by the evolutionary process with sufficiently large 
thresholds and correspondingly small death rates. If the threshold is very small, 
such that each round a significant proportion of the population dies, one might 
expect something fike in Reents' and Metzler's stochastic strategy to happen. 

However, even if the threshold is set to —0.5, i.e., new players are replaced 
immediately if they make a mistake, only a small fraction of the population dies 
each round. This is because those players that survive their first few round 
typically accumulate ^ 100 points. As outlined in [73], the score of each player 
is a random walk with a drift that is proportional to the loss of the player. If 
a player is very firm in his opinion and gives the same output every time, and 
there is no systematic preference for either output, the mean loss of that player 
is 0, and the average time for that player to die is actually infinite (this is a 
well-known result for random walks with one absorbing edge). Since this limit 
is not realized completely, players do die eventually, but at a very small rate. 
Simulations show that the time series generated by the evolutionary MG for 
Q = 2 displays very weak antipersistence {pap ~ 0.52) for a threshold of —0.5, 
and significant persistence {pap ~ 0.35) for large negative thresholds. 
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3.7.5 Reents' and Metzler's stochastic strategy 



The stochastic strategy for the MG described in Sec. 3.6 can be adapted to 
accommodate several options as well. Again, two generalizations suggest them- 
selves: 

1) Players are informed about which option won the last round. If they won, 
they repeat their decision; otherwise they move to the last winning room with 
probability p. 

2) Players only know if they lost or won. If they lost, they choose, with proba- 
bility p, to try one of the other options at random. 

In both cases, the system can still be considered a one-step Markov process. 
However, the state of the system must now be characterized by Q — 1 values 
Ni, . . . , Nq_i, which give number of players in each room (the remaining value 
Nq can be calculated from normalization constraints: '^^Nq = N), and the 
joint probability distribution is a Q — 1-dimensional tensor, or a function liv- 
ing in R*^"^ if one wants to go to continuous variables. Transition probabilities 
look even worse, taking the form of 2(Q — 1) -dimensional tensors or integral ker- 
nels. Put briefly, this problem is only accessible to simulations and very crude 
approximations. 



In the hmit of p — 0{1), N ^ oo, the behavior is analogous that detailed in Sec. 
3.6.2: finite fractions of players move from room to room, and the minority option 
changes at every time step. Suitable variables are Ui = Ni/N, the fractions of 
players who chose option i. Occupation probabilities 7r''(nj) are a superposition 
of Q Gaussian peaks whose widths decrease with N. Self-consistent values for 
the centers of the peaks can be found analytically, as the following example will 
show: 

Let us take Q = 3, and the players are not informed about which option was 
successful, i.e., if they lose and decide to switch, they choose one of the remaining 
options at random. At any given step, there are three occupation numbers, which 
we order rii < n2 < ris. Room 1 will now receive players from rooms 2 and 3, 
whereas rooms 2 and 3 gain players from the respective other room and lose 
players to all other rooms. Neglecting fluctuations, the rate equations for the 
occupations at the next time step look like this: 



If one can find a permutation of such that each is eqiial to nj with some 
j 7^ i, one has a solution. In the present case, the solution for Eq. (3.73) is 



Large p 




ni + (p/2)(n2 + na); 
n2 -pn2 + {p/'2)n3; 
ns -pns + (p/2)n2. 



(3.73) 
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— ris, = rii and ^3 = ^2 for < p < 2/3, and = ris, ^2 = 7^2 and 
^3=^1 for 2/3 < p < 1. The corresponding equations are 




Figure 3.19: Centers of the peaks of 7r(n) in the stochastic MG with (5 = 3. 
Simulations with N — 4001 agree well with Eqs. (3.74). 

Fig. 3.19 shows the results of Eqs. (3.74) compared to simulations. Data 
collection from the simulations is tedious, because the centers of each peak have 
to be fitted individually for each p; however, even the few points give a good 
impression of the accuracy. 



Small p 

The case of small p oc 1/A^ is even less accessible to analytics. 1 will therefore only 
present results from simulations. The probability distribution for the deviation 
from the optimal configuration, 7r^{Ni — N/Q), depends less and less on the order 
of magnitude of A?^ as ^ 00. However, it does depend on N mod Q - this is 
less relevant for larger p, but it becomes clear that very close to the optimal state. 
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it makes a difference whether the optimal state has a net gain of 0, —1/Q, or 
-{Q-l)/Q. 

It also turns out that, if players are not informed about which option won, 
the probability distribution does not converge to an optimal state, regardless how 
small p is chosen: even if at most one agent per time step switches sides, there is 
a chance of {Q — 2)/{Q — 1) of choosing the wrong room, making matters worse. 

A different mechanism prevents perfect coordination if agents are informed 
about the last minority option, even if mod Q = 0, i.e., equidistribution among 
the rooms is possible: according to the rules outlined above, only one room is 
declared winner. Even if perfect coordination is achieved at some point, agents 
from the rooms that are tied with the winner but lost the coin toss will move to 
the winning room, ruining equidistribution. 

The symmetry between Ni — N/Q < and Ni — N/Q > that was still 
present for Q = 2 is broken, resulting in an increasingly skewed distribution for 
increasing Q. 

Although details are more complex, the broad features of the stochastic MG 
remain the same: for large p, the ensemble of players can be treated as Q effective 
players who switch rooms, and strong herd behavior is observed. For p — 0{1/N), 
there is good, although not perfect, coordination. 

3.8 Summary and remarks on the Minority Game 

This chapter has provided an overview over different strategies for the Minority 
Game, with different approaches: the standard MG provides the players with 
elaborate strategies, but little freedom to change them. The quenched random- 
ness inherent in the model is the source of most of the interesting behavior. 

The neural networks approach presented in Sec. 3.4 also includes quenched 
randomness in the differences between weight vectors. However, this randomness 
is irrelevant for all practical purposes, and a calculation that assumes a perfectly 
symmetrical state gives good results. The idea here is to give players a fairly so- 
phisticated prediction algorithm and a learning algorithm that can, in principle, 
completely alter the parameters of this algorithm. The rate at which this mod- 
ification takes place determines whether the agents can find an anti-correlated 
state or whether they over-compensate and show herd behavior. Adaption has to 
be slower with increasing population size, and there is a smooth crossover from 
negative to positive correlations between players. 

Johnson's evolutionary MG (Sec. 3.5) models each player in a very simplistic 
way, with only one parameter, and an evolutionary learning algorithm. Neverthe- 
less, it is almost intractable analytically. The evolutionary algorithm suggested in 
the original publications induces significant self-organization. However, if newly 
generated players are allowed to copy the strategy from existing players, almost 
total coordination results (see Sec. 3.7.4). 
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Reents' and Metzler's stochastic strategy is also very simple. In the treatment 
given in Sec. 3.6, the whole population is described by a single parameter! Still, 
my feeling is that the suggested strategy is close to what people do in real-life 
situations where little information is available. It would be interesting to check 
this in psychological experiments. 

As in the case of the neural networks, coordination becomes better as learning 
becomes slower, and the rate of change has to be proportional to 1/iV to achieve 
optimal coordination. Between this regime and the regime of herding behavior 
(p =const), there is again a smooth crossover. 

It can be shown in simulations and calculations that the stochastic strategy 
(at least in the small-p regime) is robust to the presence of noise (i.e., other 
players playing different strategies), and evolutionarily stable. 

The section on the multi-choice Minority Game showed that one can find suit- 
able generalizations of all presented strategies to scenarios where players have to 
choose between more than two different options. With the exception of neural 
networks, an analytical solution seems too involved to be a realistic option; how- 
ever, simulations show that the general behavior of the strategies is similar to 
that for Q = 2, with respect to the presence or absence of phase transitions, the 
scaling regimes for parameters, and so forth. The evolutionary MG is an excep- 
tion: the original update rule suggested by Johnson et al. yields no specialization; 
however, a strategy-copying mechanism again leads to self-organization. 

Of the presented strategics, only the standard MG shows a phase transition. 
This has drawn significant attraction [51]; however, 1 attribute this more to the 
general fondness of statistical physicists for phase transitions than to good evi- 
dence that minority-like settings in real life have a phase transition. Admittedly, 
it is an established fact that time series from financial markets display power-law 
probability distributions typical for critical behavior [56]. With sufficient tweak- 
ing, it is possible to find a variation of the standard MG (the so-called Grand- 
canonical MG [83, 84]) which replicates these properties, or "stylized facts". It 
might be worthwhile to explore if this is possible with any of the other strategies 
as well. 

This chapter has highlighted some connections between the Minority Game 
and the concept of anti-predictability. All presented strategies show (very pro- 
nounced in some cases, and weakly in others) a transition to antipredictability in 
the time series generated by the ensemble of players. 

If players adapt slowly, they can find a well-coordinated state where the next 
minority decision is either very random (neural networks) or so narrow that even 
though it can be predicted, it cannot be exploited (stochastic strategy, evolution- 
ary MG with imitation). If quenched randomness is present, however, even slow 
adaption is sometimes not sufficient to remove preferences for certain outputs 
(standard MG). 

If players adapt quickly (large learning rates in neural networks, large p in the 
stochastic strategy, small initial differences in the crowded regime of the standard 
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MG), they overcompensate: large numbers of players change their opinions at 
the same time, defeating their own prediction. The Minority Game is thus a 
well-motivated framework in which antipredictability emerges naturally in some 
limits, and it allows to study the circumstances that lead to failure of prediction 
algorithms. 
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Chapter 4 

Neural Networks in Two-Player 
Games 

The Minority Game that was introduced in Chapter 3 and related to the concept 
of antipredictabihty is only one example out of a large number of game-theoretical 
scenarios. Neural networks have been applied to some of these scenarios; however, 
there seems to be little work on very basic, concrete questions concerning simple 
neral networks playing simple two-player games. 

This chapter will present a short overview over the different areas of game 
theory, introduce some basic concepts, and then go into details on how and to 
what extent a simple neural network can be used to learn strategies in two-player 
games. 

4.1 An overview over game theory 

The field of game theory can be divided into several distinct areas. Although the 
mathematical treatment of pure games of chance triggered important research 
on probability theory [85, 86], game theory in the modern sense often docs not 
include randomness as a necessary ingredient. The usual scenario consists of 
two or more players who have a choice between a set of options or strategies. ^ 
The payoff that each player receives depends on the choices he makes and the 
choices of the other player (s), which he cannot influence. The aim of each player 
is to maximize his personal payoff ^. The difference between the fields lies in the 

"'^The usual convention in game theory is to call each possible choice a strategy. However, to 
avoid confusion between strategies, pure strategies, and mixed strategies, I will continue to call 
them options, and use the term strategy for the set of probabilities of choosing each option. 

^Newcomers to game theory often feel that this approach oversimplifies the motivations and 
emotions of players. Game theorists usually answer that any motivation can, in principle, be 
quantified and accounted for by modifying the payoffs. Whether this is indeed the case can 
be argued; however, to allow for an analytical approach, a quantification of preferences seems 
necessary. 
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rules of the game, and in the way that the players change or don't change during 
repetitions of the game: 

4.1.1 Combinatorial game theory 

In combinatorial game theory, players take alternating turns and usually have 
full information about the state of the game and all possible future combinations 
of moves. A simple example of this type of game is Nim: the game starts with 
several heaps of matches, and each player can take an arbitrary number > 1 of 
matches from one of the heaps. The last player to take a match wins. Many 
combinatorial games can be reduced to Nim and studied using a generalization 
of rational numbers called Nimbers (sec "Winning Ways" by Berlekamp et al. 
[87] for details on Nim, an introduction to Nimbers, and lots of information on 
other combinatorial games). 

More complex cases of combinatorial games are Four-in-a-Row, Dots-and- 
Boxes, Go and Chess. Although it is in principle possible to tell in advance how 
the game will end if all players play optimally and which moves are optimal in 
a given situation, one can argue that all "interesting" real-life games require too 
much computational effort for a human player to go through the complete analysis 
(in fact, some games have been proven to be NP-complete, i.e., the computational 
effort increases exponentially with the size of the board [88]). A case in point 
is TicTacToe, which is simple enough to grasp completely and considered boring 
by most people beyond the age of 10. On the other hand, a complete analysis of 
chess is beyond the capacities of even the most advanced computers. 

Thus, players of such games rely on heuristics (or experience and "intuition" , 
which amounts to the same thing) to evaluate the quality of moves in games like 
Chess [89]. If moves seem equivalent to players, they pick among them more 
or less randomly, which allows for stochastic treatments of games that are, in 
principle, deterministic (see e.g. Ref. [90]). 

4.1.2 Economic game theory 

In economic game theory [91], uncertainty is an essential ingredient, since players 
make their choices simultaneously, and none of the players know what choices the 
other(s) will make. The simplest scenario considered in economic game theory 
is that of two- player zero-sum games [12]. Here, each of the two players has a 
finite number of options. Each combination of options is assigned a number (an 
entry in the payoff matrix C) which the first player has to pay to the second if 
that combination is chosen, such that the gain of one player is always the loss of 
the other. A popular example is Rock-Paper-Scissors. Zero-sum games always 
have a unique combination of strategies for the two players where the strategy 
of each player is optimal if the respective other player does not deviate from his 
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strategy. Such a combination is called a Nash equilibrium. More details on zero- 
sum games, as well as a formal explanation of Nash equilibria, will follow in Sec. 
4.2. 

In contrast, non-zero-sum games can have different payoff tables for each 
player. This makes the situation more complicated, allowing for cases in which 
there are several equilibria or in which the solution that maximizes "global good" 
(the sum of average gains) is not stable because options exist that increase one 
player's payoff, but decrease the sum of payoffs. The most infamous scenario is 
the Prisoner's Dilemma [92]. 

Some non-zero-sum games involve a large number of players and a fairly simple 
payoff scheme. For example, the Minority Game described in Chap. 3 is a 
negative-sum game with an arbitrarily large number of participants. 

Games are also classified as complete information games if all players know 
all entries of all payoff matrices, and as incomplete information games if each 
participant only knows part of the payoffs (usually at least his own) . 

4.1.3 Evolutionary game theory 

This branch may be considered a spin-off of economic game theory which is con- 
cerned with the mechanisms by which equilibria can be reached [77, 93] . Typical 
models include populations of players with individual strategies that repeatedly 
play against each other. Successful players generate offspring with the same or 
at least similar strategies, whereas less successful strategies can become extinct. 

Evolutionary game theory has become a testing ground for simple models of 
interpersonal interactions, allowing to study under which circumstances coopera- 
tion among players can exist, when cheating is to be expected, and what ways of 
behavior favor which response. As the name suggests, it has become extremely 
important in evolutionary biology, where it is used to model animal behavior and 
has shed light on many aspects of mating behavior, cooperation within herds, 
and so forth. 

Interestingly, many popular games do not easily fit into the categories of either 
theory. For example, card games like Poker, Rommee or Magic:The Gathering® 
and board games like Monopoly® or Siedler von Catan@ mix combinatorial 
elements with a strong stochastic component (rolling the dice, drawing cards) 
and/or incomplete information (what cards are the other players holding?). In 
principle, they can be formulated as matrix games if the realization of randomness 
is included as the choice of a "dummy player", and picking an option actually 
means deciding, in advance, what move to make for each conceivable situation 
during the course of the game. The number of different options is uasually expo- 
nential in the number of cards and/or die rolls, and the formulation as a matrix 
game serves purposes of theoretical analysis rather than practical guidelines how 
to play. 
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4.2 Zero-sum games 



Zero-sum games, in which the gain of one player always corresponds an equiva- 
lent loss of the other player, can be treated with a fairly elegant mathematical 
apparatus introduced by John von Neumann [12]. Let us call one player A and 
the other B. In each round, A chooses one among K options. In the following 
analysis, the option that is actually chosen will be denoted hy 1 < i < K. At 
the same time, B picks an option j from his repertoire of L options. They then 
consult the entry Cij of the payoff matrix C, and A receives c^- from B. Accord- 
ingly, A will try to play strategies where large Cij are picked, and B will try to 
keep Cij small (negative, if possible). 

In some realizations of C, it may be optimal for A to keep choosing one 
option which is better than the alternatives, no matter what B plays, like in this 
example: 

/ 10 2 5 \ 
C= 3-21. (4.1) 

V-5 14; 

Here, all entries in row 1 arc higher than the corresponding entries in rows 2 and 
3 - row 1 is said to strictly dominate rows 2 and 3, and picking it is a no-brainer, 
so to speak. Player A is said to play a pure strategy if he chooses the same option 
all the time. If B knows that A is going to repeatedly pick row 1, all that is left 
to do is minimize the losses and pick column 2 every time as well. 

For random payoff matrices, the probability that one pure strategy dominates 
all other strategies decreases exponentially with increasing K. Even for small 
payoff matrices, the interesting cases are the ones without dominating strategies, 
such as Penny- Matching: 

In this game, player A tries to play the same option as player B, who in turn 
tries to play the opposite option of player A. The only way for a player to avoid 
being exploited is to choose at random between the two options - in this case, 
with 50% probability for each. 

Generally, the behavior of players A and B at a given time can be defined 
by the vectors a and b of probabilities and hi for playing options / and m, 
respectively. These vectors follow the normalization constraint = 1 and 

hi = 1, i.e., they lie on the K- and L-dimensional simplex. If a player is playing 
a pure strategy, his strategy vector only has one nonvanishing component. 

The expected payoff A;^ for A of one of his strategies k depends only on B's 
strategy: 

Xt^J2^iCki. (4.3) 

I 

Player A will prefer strategies with higher A^. In the language of game theory. 
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a strictly rational agent will always prefer the option with the highest A^, thus 
playing a pure strategy every time, unless the the expected payoffs of several 
strategies are exactly tied. This almost never happens in models where the player 
estimates the opponent's strategy from experience, rather than calculating the 
opponent's equilibrium strategy and assuming he will play it [15]. A more tolerant 
approach, and one that emerges automatically in the neural network learning 
model discussed later, is to give higher probabilities to options with higher payoffs, 
and gradually eliminate those that promise significantly lower payoffs completely. 



4.2.1 Nash equilibria 

A Nash equilibrium is a combination of strategies a* and b* such that no player 
can improve his average payoff by unilaterally deviating from his equilibrium 
strategy. The average payoff for player A is 

A^ = ^aiQ,-6,-. (4.4) 

He will try to maximize the payoff he receives if B makes life as difficult as possible 
for him, and will choose 

maxmin > aiCijbj. (4.5) 

On the other hand, player B, who is trying to minimize the amount he pays to 
A, will choose 

min max UiCijbj. (4-6) 

The Nash equilibrium fulfills both conditions at the same time. As von Neumann 
proved [12], in zero-sum games, there is exactly one Nash equilibrium. If it is a 
mixed strategy, all options that are played with non-zero probability (the support 
of the strategy) yield the exact same average payoff, also called the value of the 
game u. The necessary and sufficient conditions for a* and b* can be written in 
a very concise way: 

Xf = J^Cy6*<i/ foralH; (4.7) 

j 

Af = ^ a*Cij > V for all j. (4.8) 

i 

The minimax theorem guarantees that the smallest value for which a solution to 
Eq. (4.7) can be found is the largest value that yields a solution to Eq. (4.8). For 
a given payoff matrix, the equilibrium strategies can be calculated using linear 
optimization. For details, consult Ref . [94] and [95] . 

The interpretation of the Nash equilibrium is not very intuitive and worth 
spending a few words on. If player A plays the "optimal" strategy a*, he is 



88 



guaranteed the average payoff u. There is no way for B to exploit A. However, 
there are usually several options for B that are precisely equivalent - each one 
makes B pay z/ to A on average. In that sense, it is not a big challenge to play 
against a player who sticks to a* and does not respond to his opponent's actions 
- B has to find one of the strategies in the support of b* and play it. ^ Doing this 
opens up player B to exploitation from A, which in turn would allow B to exploit 
A. On the other hand, it means that a learning algorithm of player B cannot be 
expected to converge to the minimal-optimal solution b* if A always plays a*. 

In a game with a finite number of options, a small deviation from a* or b* 
only causes a small difference in the A^, which can only be empirically detected 
through many repetitions of the game. A pair of learning algorithms that rely 
only on observing the opponent's actions will therefore converge to equilibrium 
very slowly (if they converge at all) . 

4.3 Multi-Choice Perceptrons 

Economic game theory often assumes perfectly rational players with unlimited 
computational resources and a perfect grasp of the situation: both players study 
the game thoroughly before they start playing, find the Nash equilibrium and 
play the corresponding strategy, thus fulfilling the expectations of the opponent, 
who likewise plays his optimal strategy. Since this rarely applies to real-world 
situations, the question of learning in games has received much attention (see 
Ref.[96] and references herein): how can a player who is not familiar with the 
behavior of the opponent find a strategy, based on his knowledge of the payoff 
matrix and the observed behavior of the opponent? 

Even in learning scenarios, there is a concept of a rational player: it means a 
player who, based on some prediction algorithm, calculates the expected payoff for 
each of his options and always chooses the one that promises the highest payoff. 
However, it was shown in Ref. [15] that under rather general circumstances, a 
rational player has no chance of learning to play the correct equilibrium strategy. 

It is therefore interesting to search for strategies that are not rational, but 
that will converge to the Nash equilibrium, at least under some circumstances. 
One obvious candidate are neural networks, whose most interesting property is 
the ability to learn from examples. If each player has only two choices, a simple 
perceptron could be used to make the decision. This has been studied in some 
detail in Ref. [66] for some simple games like "Matching Pennies" . That case was 
also examined in Ref. [30]. However, even in the case oi K = 2, a. more complex 
network may be appropriate: a simple perceptron with random unbiased patterns 
will pick each option with 50% probability - this is generally not the optimal 
mixed strategy. If a larger number of options is available, a different architecture 

^This can be readily verified using the corresponding program in Ref. [94]. 
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is needed in any case. The obvious choice is the Potts perception, also known as 
Multi-Class perception [81]. 

A simplified version of the Potts perceptron was introduced in Section 3.7.2; 
here, we need the full architecture. This is still the simplest neural network 
that can handle inputs and outputs of the required form - quite possible, more 
elaborate networks could give better results; however, their behavior becomes 
mathematically less tractable. 

To be able to choose one of K output options, the network of player A consists 
of K hidden units that generate K hidden fields hi. 

The output a is usually taken to be the option with the highest hidden field: 

a = {i\hi = max(/i;)}. (4.9) 

This rule is the reason why this architecture is also called "Winner-takes-all 
perceptron" (WTAP, for the sake of brevity). For reasons explained below, it 
can be helpful to take as output the option with the highest absolute value of the 
hidden field: 

(T = = max (4-10) 

For real-valued input vectors x, each hidden unit is a linear perceptron: 

hi^wi-x. (4.11) 

If the input vectors consist of integers between 1 and Q (for example, a time series 
of the opponent's decision), a more elaborate scheme is needed: each hidden unit 
has a set of Q weight vectors w^^ with entries Wi^. The hidden field for output I 
is 

Q N 

q n 

This can be written as 

Q 

hi = ^h1 with h1 = w1-x'i, (4.13) 

9 

where the are vectors with components = QS^^^g — 1. Thus each of the 
possible input values generates a separate contribution to the total hidden field. 

In the following sections, I will assume real- valued inputs, which are easier 
to treat and to imagine. As usual, the variance of one component of the input 
is taken to be (x^^) — 1. Later on, the learning rules will be generalized to 
multi-value inputs. 

What can and should we expect the neural network to do? If the inputs are 
completely random, the network should at least be able to use these patterns as 
seeds to generate a probability distribution of outputs and to adapt this distri- 
bution such that it gradually improves the perceptron 's average payoff. If the 
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inputs indeed contain some information on the opponent's action, the network 
might be able to capitahze on this. However, if the opponent is also a network of 
similar capabilities, there is no a priori reason why one should be able to outplay 
the other. 

It turns out that learning has to follow different principles depending on 
whether the input pattern has a bias (a preferred direction) or whether it is 
an isotropic random vector. Note that it is not important whether this bias car- 
ries any information about the opponent or not. If the aim were to create a good, 
applicable learning algorithm, one could artificially add a bias vector. However, 
my first aim is to see what happens for a rather naive approach to learning in 
different scenarios. 

4.4 Learning from unbiased patterns 

4.4.1 Generating a probability distribution 

As mentioned in Section 4.3, one of the feats that a neural network in a two-player 
game should be capable of is to generate any probability distribution of outputs 
when fed with random patterns. The first step is therefore to check whether this 
is feasible. 

If the weight vectors of the network's hidden units are mutually perpendicular 
(wj-Wj = w^Sij), the hidden fields are uncorrelated Gaussian random numbers 
with variances (h'f) — wf. ^ The probability Oj that option i is chosen is then 

r dhi f lhf\A dh. ( 

It is clear that under these circumstances, no probability larger than 1/2 can be 
achieved: even if all other weights are set to (and correspondingly their hidden 
fields are 0), there is only a 50% chance that the hidden field of the nonzero 
vector is larger than 0. 

It is also clear that no probabihty can be smaller than 2"^+^: if one weight is 
set to while all others have a finite value, there is still a chance of 2"^"*"^ that 
all others are negative, and thus the field with a value of wins. The function 
(4.14) interpolates between these two extremes. 

A small variation of the decision rule removes these limitations: if the output 
is determined by the largest absolute field, as suggested in Eq. (4.10), Oj is given 

^The assumption of perpendicular vectors simplifies things immensely. Of course, it must 
be justified later, when learning algorithms are considered. 



(4.14) 
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by 



2dhi f 1 \ TT 2dh. f 1 
exp I -7^-^ 111/ r-^ exp ' - 

^0 . b.TTl^^ 



E{x) = / — =exp(-/.V2). (4.15) 
^0 v27r 



This rule allows for probabilities Oj between and 1 - however, the interpretation 
of Eq. (4.10) is admittedly dubious if the pattern has any meaning beyond being 
a random number seed: The invariance of Eq. (4.10) to the sign of h essentially 
states, "If the situation is such-and-such, do this, and if the situation is exactly 
the opposite, do the same." 

In the simplest case of X = 2, Equation (4.15) can be evaluated analytically, 
and we obtain 

ai = (2/7r) arctan(wi/'u;2) and 02 = 1 — oi. (4.16) 



4.4.2 Learning rule 

If we assume that the patterns are random and the strategy is only determined 
by the set of weight norms {wi}, we need a learning rule that adapts the norms 
accordingly. The following rule employs the mechanisms of the Confused Bit 
Generator: if a perceptron learns the opposite of its own output, its norm con- 
verges to a finite value. If it reinforces its output, the norm grows linearly with 
a. A plausible first attempt at a learning rule is therefore 

w[+' = w[ + ^xsign(x-wOQ,-. (4.17) 

The row player updates each of its weights using the column j of the payoff matrix 
that his opponent chose. Weights with a high payoff get positive feedback, while 
weights with a negative payoff are suppressed. Averaging over many outputs j 
and patterns x, Eq. (4.17) becomes 

(w^' - w[),-. = ^(xsign(x-w,)(.^ bjcij. (4.18) 

i 

That means that the feedback term on the right hand side is proportional to 
the expected payoff = 'Ylj^i'^ij f*^^ t^a-t option. A positive value of \f leads 
to linear growth, a negative value leads to a finite value, and a vanishing value 
= gives wi oc ^/a. 

This may be problematic since the value of the game u may be different from 
0, leading to a suppression of all options for 1/ < or enhancement of non-optimal 
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options for i/ > 0. An even better learning rule incorporates this and estimates 
the value of the game by subtracting the current payoff from the potential 
payoff of the updated option, cif 

w^' = w* + ^xsign(x-wO(Qj - Cij). (4.19) 

This rule is somewhat more difficult to treat, since the update for each option 
explicitly depends on what option was chosen by the player, i.e., it depends on 
all other Wi as well. For large K, one can ignore this effect, but for small K 
it turns out to be important. Thus I will first look closer at the textbook case 
of = L = 2 for specific matrices (called generalized penny- matching in the 
literature) and then go to the limit of large K for random matrix entries. 



4.4.3 The case of K = 2 

As mentioned above, for small K, the output of the WTAP has to follow the 
largest absolute value of the inner field \hi\ = |wj-x| - otherwise both options 
would be chosen with 50% probabihty each. Using absolute fields, the strategy 
for two options is given by Eq. (4.16). In the spirit of on-line learning, we can 
take the square of Eq. (4.19) and use the limit M oo to transform the update 
equations it into a set of differential equations for the Wj. For example, for option 
1 of player A, one obtains (with x-x = M): 

= + J^\hi\ici, - c..) + - ^^j)"- (4-20) 

The second and third term on the right hand side vanishes if \hi\ > \h2\, which 
has to be taken into account when averaging: 

{(cij-Cij)'') = X2(6i(cn-C2i)' + 62(012-022)'). (4.22) 

In this special case, Xf is an abbreviation for biCa + 62^2. Using the chain rule, 
a differential equation for wi can be derived: 




dwi 2 I wi 




da V TT \ ^wj + 



w. 



2 



{Xt - X^) + 



rfa2 [6i(cii - C2if + 62(012 - C22)'] /(2wi). (4.23) 

A similar equation can be derived for W2- If player B is using a neural network 
with the same learning rule (whose vectors I will denote as Vj with norms Vj to 



93 



avoid confusion) , the differential equations for its weight vectors norms vi and V2 
can be written down as well: 



dW2 2 I W2 




da \ TT \ + 



w. 



ifai [6i(cii - C2if + 62(012 - C22)'] I{2w2). (4.24) 
rfx2 [6i(cn - 0^2? + ^2(021 - C22)'] /(2i). (4.25) 

dV2 [2 ( ^ V2 \ , B \B 



(Af - A^) + 



da V TT y ^vf + 

V% [ai(cii - C12)' + a2(c2i - C22)'] /(2^;2). (4.26) 

The symbol A^ stands for the expected payoffs of B's options aiCij + 0202^, and 
player B favors strategies j with minimal A^. This set of ODEs can now be solved 
numerically for any 2x2 payoff matrix. In simple cases, the asymptotics can be 
calculated analytically. For example, take the payoff matrix 

In this case, A^ = bi, = 1, Af = 1, and Af = 02. One can assume that 
for large a, w-i <^ W2 and vi <^ V2, and correspondingly make the expan- 
sions ai ~ {2/7r)wi/w2, W ~ {2/t:)vi/v2, W2I \/w\ + W2 ~ 1 — {wi/w2f/2, and 
V2/ s/vl + V2 ~ 1 — (i'i/f2)^, a short calculation leads to the asymptotic value 
for Wi and V2. Wi,V2 — > a/vt/S?]; for W2 one gets W2 oc a^/^, corresponding to 
Oi oc a~^^'^. With this result, the exponent for V2 and bi can be calculated; it is 
V2 oc a^/^ and bi oc a~'^^^. The complete numerical solution of Eqs. (4.23-4.26) 
agrees excellently with simulations and also confirms the calculated asymptotic 
exponents, as seen in Fig. 4.1. 

It turns out that the minimax-optimal solution for this problem is a* = (0, 1), 
b* = (1,0); i.e., b converges to the wrong solution. However, if a = a*, the payoff 
is indifferent to the choice of b, and as long as a 7^ a*, b = (0, 1) is the best 
response of player B. 

In this example, the proposed learning algorithm converges to a strategy that 
gives the same payoff as the equilibrium strategy, but the rate of convergence 
is slow (power-law convergence with small negative exponents) and depends on 
the realization of the matrix. A look at Eqs. (4.23-4.26) shows that the Nash 
equilibrium (which is characterized by A^ = A2 and Af = Af ) is a stationary 
point of the update equations only in the long-time limit where l/wj and 1/vj are 
negligible. This is better than no convergence at all, but can hardly be considered 
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Figure 4.1: Development of strategies in a zero-sum game with payoffs according 
to Eq. (4.27), under learning rule Eq. (4.19). Symbols in the large plot denote 
simulations with M — 2000. 

good in a situation where a few round should be enough to clarify what is the 
best thing to do. However, a learning algorithm, even a slow one, may be more 
useful in situations where the optimal behavior is not as obvious, such as for large 
payoff matrices. 

4.4.4 Large K 

Let us now turn to the case of large payoff matrices. To make statements that 
are valid for fairly generic cases, I will also assume that the entries are random 
and uncorrelated - to be specific, they are random Gaussian numbers of variance 
{c^j) = K. The statistics of the minimax-optimal solution for this scenario were 
studied by J. Berg and A. Engel [97, 98, 99]. Combining these results with 
simulations and calculations that I did, the following picture emerges: 

If player A picks a random strategy a,., the expected payoffs of player B's 
options arc again Gaussian random numbers with zero mean and a variance of 
roughly 1 (the exact value depends on the self-overlap of the random strategy 
vector) . 

If both players are playing the minimax-optimal strategies a* and b*, and 

K = L, on average 50% of each player's options arc in the support of this strategy, 
i.e., they are played with non-zero probability. Each of these yields an average 
payoff of i>, which is for — > cxd and a Gaussian random number of mean 
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and variance (i/^) oc l/X for finite K. The strategies which are not played give 
a payoff that is lower than v and follows a Gaussian probability distribution. 

In order to sec how a WTAP that learns according to Eq. (4.19) behaves, it 
is necessary to assume that the entries in the payoff matrix are not correlated to 
the strategies of the players. This may seem counter-intuitive at first: after all, 
player A ought to prefer strategies with above-average entries. However, if player 
B can avoid being exploited completely (by a learning algorithm or other clever 
strategy), player A will have to put up with taking strategies with below-average 
entries as well, at least for large where dominating strategies are exceedingly 
unlikely. 

The mentioned assumption leads to {cfj)ij — {c^j)ijK and {cijCij)ij — Kai. 
It is easy to calculate that (x-WiSign(x-Wi)) = ^J1|'KWl^ and to combine these 
results into 

du? ( \ 

' V^V - J] «Ai + '^rfK{l - ai) . (4.28) 



^ 

da 



The second term on the right hand side is a random walk term. The factor 
(1 — ai) can be explained by a look at Eq. (4.19): the weight that belongs to 
the currently chosen option is not updated. More important, however, is the first 
term, which causes the weights with above-average success to grow and those 
with below-average success to shrink. This term is proportional to wi, so it will 
dominate for large a and correspondingly large Wi. 

If both players play their optimal strategies a* and b*, 'Y^^aiXi is equal to 
the value of the game z/, and the first term in Eq. (4.28) goes to for those 
options / that are part of the optimal strategy, and stays negative for those that 
are not. Neglecting the second term, this would be an indication that the optimal 
strategies are a fixed point of the considered learning rule. 

To be sure that the assumption of perpendicular vectors underlying Eq. (4.14) 
holds, we must examine the time development, and hence the fixed point, of angles 
6im = 3iccos(yf I ■ \^ ra / {wiWm)) between the vectors. A calculation similar to the 
one above yields for the unnormalized overlap Rim — wz-w^ : 

= ri\l^COs{9im) Wi {^rn " ^ + W-rn " ^ \^ 

+ rf{l-'^-^^K{l-ai-am). (4.29) 

Using the chain rule, an expression for the development of cos{6)im can be derived. 
Interestingly, all terms proportional to r] vanish: 

= r] K [ [1-ai-am) 

da \ wiWm 

C0S(^,^) COs(e;^) A 

2 — (l-«0 o — (l-a„) • (4.30) 
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Of the remaining terms, the ones proportional to cos{9)im dominate: the prefactor 
l/{wiWm) is smaller or equal to l/w"^ + 1 — a; — is smaller or equal to 

1 — ai and 1 — a^, and near the fixed point 9 = 7r/2 + A^, an expansion gives 
1 - 2e/7r -2Ae/n, whereas cos(7r/2 + AO) pa -AO. Hence, for small AO the 
right hand side is proportional to — A^, and ^ = 7r/2 is a stable fixed point. This is 
reassuring and tells us that the use of Eq. (4.14) is at least a good approximation. 

The approximation can be tested in simulations, which show that angles be- 
tween different weight vectors fiuctuate around 0, with a variance that depends 
on M. Therefore, Eq. (4.14) agrees well (~ 1% accuracy for M = 100) with the 
output probabilities measured in the simulations. This allows to calculate the 
mixed strategy, and hence the learning success, at any point in the simulations 
with reasonable computational effort. 

Playing against an opponent with a fixed strategy 

The first test case is playing a simplified matrix game against an opponent who 
follows a pre-determined strategy, i.e. chooses columns j with constant probabil- 
ities bj. 

If this fixed strategy is chosen at random, the distribution of expected average 
payoffs Aj will be Gaussian with a variance near 1, as mentioned above. Without 

loss of generality, we can rename options such that Ai > A2 > . . • A^^. ^ The 
best strategy for player A is then = 6i^i. A WTAP that uses Eq. (4.10) and 
is updated according to Eq. (4.19) should converge to the optimal solution: The 
weight belonging to the most profitable option always grows at least like y/a: 

OjAj^ Wi + 2iffK{l — oi) 

ai). (4.31) 

Pure strategies that give below-average payoff (compared to the current av- 
erage, i.e., to ^CjAj) have a negative growth term in their update equation, i.e., 

their weight stays bounded, and their probability of being played goes to zero, 
removing them from the competition. Gradually, the average payoff increases, 
eliminating one non-optimal choice after the other. This is what happens in 
simulations as well, as seen in Fig. 4.2. 

As pointed out previously, if B is constantly playing the optimal strategy b*, 
the best thing that A can do is to eliminate the strategies with Aj < z/ that are 
not in the support of a*. As Fig. 4.3 shows, the neural network succeeds in 
doing this: after a learning time of a = 100, the suboptimal options are strongly 
suppressed (only the option with the payoff closest to v still has an appreciable 

^The case that two expected payoffs are exactly equal occurs with probability 0. 
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Figure 4.2: Playing against a non-adaptive opponent with a random strategy: the 
weight of the most advantageous strategy 1 grows fastest, whereas the weights of 
below-average strategies shrink (Plot (a)). Correspondingly, the most profitable 
strategy is used increasingly (Plot (b)). Simulations used a random matrix with 
X = L = 5 and M = 100. 



probability), whereas the strategies in the support of a* are played with random 
positive probabilities. 

Playing against adaptive opponents - Fictitious Play 

One of the standard learning algorithms in game theory is Fictitious Play [96]: 
the adaptive player (in this case, player B) keeps a histogram of his opponent's 
decisions i'^ up to a time t, which he uses to estimate A's strategy a by a: 

«ifc= V- (4-32) 

Player B then estimates the average payoffs of his available strategies assuming 
that A will continue playing as he did before: 

^'' = 5]%«fc' (4-33) 

k 

and chooses the output with minimal A^. This updating rule has been proven 
to converge to the optimal strategy if both players use it, in the sense that the 
empirical estimate a converges to a*, and b to b*, as t ^ oo [96]. 

However, in a different sense, it does not converge: at any given time, the 
output of the "Fictitious Play" -player is deterministic, and he often repeats the 
same output for several time steps, so he never plays the optimal mixed strategy. 

Simulations show that in the long run, when Fictitious Play and the proposed 
neural network algorithm play against each other, they converge to the Nash 
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Figure 4.3: Components Oj of the strategy vector versus expected payoff Aj, 
against an opponent using the equilibrium strategy b*. The figure shows the 
state after 10^ learning steps with K — 20, M — 100. 

equilibrium in the long-time average sense. At any given moment, the output 
distribution of A is significantly different from a*. However, just as the histogram 
of outputs of two players doing Ficticious Play converges to a* and b*, so does 
the output of the Neural Network player. Fig. 4.4 shows the components of a 
and b (the empirical long-time averages) versus the components of a* and b* 
after 10® steps in a simulation with K = 20. Perfect convergence to equilibrium 
would result in all points lying on the diagonal. The figure also shows the current 
strategy a of the Neural Network at the moment when the simulation ended. One 
can see that the deviation of the current a (empty squares) from the equilibrium 
strategy is larger than that of a (empty circles), which agrees reasonably well 
with a*. 

The fluctuations of a around the optimal strategy can be quantified by in- 
troducing an overlap it! = a-a*/|a||a*| and an expected payoff = (X^j Af ) 
averaged over times short compared to the total learning process, it! is observed 
to fluctuate around high values ~ 0.9, but never converges to 1, whereas A"^ fluc- 
tuates around the value of the game, as seen in Fig. 4.5. The figure also shows 
the overlap of a with a* (dotted line), which seems to converge to 1. However, 
the rate of convergence depends on the realization of the payoff matrix. 
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Figure 4.4: Averaged empirical and current strategies a and a of a neural net- 
work playing against Fictitious Play (empty symbols) and empirical strategy b 
of the opponent (filled symbols) after 10^ learning steps with i^T = 20, M = 100, 
compared to the equilibrium strategies a* and b*. Perfect agreement with the 
equilibrium strategies would result in all points lying on the diagonal. 
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Figure 4.5: Dynamics of the overlap R between a and a* (solid line) and between 
a and a* (dotted line) of a neural network playing against Fictitious Play. The 
inset shows the payoff for A, averaged over 1000 round of play. The simulation 
used M = 100 and K = 20. 

Playing against an equivalent opponent 

One rather obvious question is how the proposed learning algorithm fares against 
an opponent who is using the same architecture and the same update rule. The 
answer is, quite good, with some exceptions. Generally, the principles of the 
algorithm work fine: options that give above-average payoffs are strengthened, 
the others arc suppressed. If equilibrium is reached, the weight vectors continue 
to grow slowly due to the random walk term in Eq. (4.28), which gradually slows 
down the dynamics. Simulations show that the strategies of the two players 
usually reach a high overlap with the Nash equilibrium. Fluctuations around the 
optimal strategy decrease with increasing M and decreasing 77. 

The mentioned exceptions concern two possible problems: for one, there is 
again the possibility that an option that is not in the support of the optimal 
strategy has a payoff that is only negligibly below the value of the game. This 
option is very hard to tell apart from a "good" one, i.e., one that is in the support, 
by looking at average payoffs, and may stay in the strategy mix for a long time. 
From the point of view of optimization, such an option corresponds to a very 
shallow valley whose minimum has to be found. 

The second problem can occur if the networks of both players are fed with 
the same pattern. If two weight vectors of the two networks have an angle other 
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Figure 4.6: Two neural networks with M = 1000 playing a zero-sum game: 
the plot shows how the overlaps with the equilibrium strategy increase during 
learning, then saturate at some value that depends on M and rj. 

than tt/2, the output of the players is mutually correlated. Although a calculation 
similar to that leading to Eq. (4.30) shows that the overlaps between the different 
weight vectors of the two players, Wi-Vk/{wiVk), should have a stable fixed point at 
0, simulations show that a significant overlap (on the order of 0.9) can develop. If 
one network can predict the output of the other beyond the point of knowing the 
probability distribution a, the concept of an equilibrium mixed strategy breaks 
down. It appears, however, that this irregularity appears less frequently with 
increasing M. 

A remcirk on efficiency and more general schemes 

The proposed learning algorithm essentially uses the patterns as random number 
seeds, without drawing any further information from them. The only relevant 
property of the weight vectors was their norm. If the task is to find an algorithm 
that learns to play a two-player game, would it not be more efficient to reduce 
the problem from updating and multiplying vectors to rolling Gaussian random 
numbers and updating their variance according to Eq. (4.28)? Indeed it would. 
This abstraction also opens the way to all sorts of fiddling with the learning rule: 
the noise term (the term proportional to rj^ in Eq. (4.28) can be made larger 
or smaller, the learning rate can be made time-dependent, the variance can be 
increased linearly, polynomially or exponentially. . . possibilities are limitless. 
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This opens the door to a more general class of learning algorithms for two- 

playcr games: adjusting the variances of random numbers, the largest of which 
determines the output, is one way of updating the probability vector. However, 
it is not trivial to find one that even comes close to the Nash equilibrium. 

A remark on time scales 

The analysis of the learning algorithm tacitly assumed the usual limit M — > cxo, 
which allows to replace {cij) with hjCij - as long as K is finite, each option 
j that is played at all is played an infinite number of times during a short time 
interval Aa. In reality, M is always finite - and as the remark above showed, M 
is not really relevant as the dimensionality of a vector space. However, a certain 
number of rounds have to be played to get a good estimate of the expected payoff 
of an option. Let us assume that strategies are fixed for a moment. If a large 
number t of rounds are played, player A played option i t{ai + Aj) times, where 
Aj is a Gaussian random number of mean and variance (Af) = aj(l — ai)/t. 
The mean square deviation between the observed and the expected payoff for one 
of B's options j is then 



To decide which one of two options with payoffs Af and Af is better, player B 
has to wait roughly t > K/{\f — X^Y time steps. 

Learning in two-payer games is, as is often the case in learning scenarios, a 
tradeoff between accuracy (exact judgment which option is the best) and speed 
(to avoid being exploited by insisting on a suboptimal strategy for a long time). 
Things become even trickier near equilibrium, since the difference in expected 
payoffs decreases the closer the opponent comes to his optimal strategy, and 
vanish if he reaches it. At that point, any drift term in a learning algorithm 
vanishes. Since the noise term remains, fluctuations are inevitable. 

An optimized learning algorithm might take this into account and adapt 
quickly as long as differences in the expected payoffs are large, while it reduces 
both the rate of adaption and, if possible, additional noise terms when it is near 
an equilibrium. However, it is not obvious how proximity to equilibrium can be 
detected unambiguously. 





K 



(4.34) 



t 



103 



Learning from time series 

So far, the patterns that the neural networks used had no meaning in themselves 
and contained no information on the opponent's actions. This could be considered 
the worst case, and it is good to know that the networks responded reasonably 
if the update rule was chosen accordingly. If the patterns now represent real 
information - e.g., they contain the time series of the opponent's output - should 
not the networks perform much better? 

To check this, we have to leave the simphcity of real- valued patterns behind 
and use Q-state patterns, as described in Sec. 4.3: the components of A's patterns 
can take Q"^ = L values, whereas B's patterns consist of — K values. The 
first update rule to try is the generalization of Eq. (4.19): 

wf = wf + ^Mg^{hi){ci, - c,j) . (4.35) 

The output is determined by the hidden field with the highest absolute value, 
and the patterns that each network sees are the time series of the other player's 
output. 

The outcome is disillusioning: the output probabilities have no similarity with 
the optimal strategy. A closer look reveals that both perceptrons lock into long 
stretches of repeated outputs, similar to a single Bit Generator with fixed weights. 
Since weights are not exactly fixed in this case, cycles do no continue forever, but 
nevertheless, no convergence to an equilibrium is observed. 

Could the network possibly draw useful information from the patterns and at 
the same time avoid falling into cycles if the components of the patterns had no 
temporal correlation? For example, one can feed the networks an artificial time 
series of an opponent playing his optimal strategy with true random numbers. 
The results are again surprising: while the strategy of the networks at any given 
time is not very close to equilibrium, the long-time average of outputs converges 
to equilibrium fairly quickly, similar to two players playing Ficticious Play against 
each other. 

One more detail shows that some new mechanisms are at work: if the output 
is determined by the largest hidden field hi instead of there is no convergence 
to equilibrium in any sense. That points to the fact that the sign of hi is no longer 
random, which has its reason in an inherent bias in the patterns. 

4.5 Learning from biased patterns 

A bias in the patterns can easily lead to a bias in the weights (which are, after 
all, a linear combination of the initial weights and the patterns that are shown 
during the learning process) , which can in turn cause a bias in the hidden fields. 
To understand these effects systematically, it is easier to go back to real-valued 
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patterns with a bias, and later make the generahzation of the terms introduced 
there to multi- valued inputs. 

To be specific, a biased pattern vector consists of an unbiased random vector 
X and a constant vector y of norm y. The weight vectors can also be split into 
components parallel and perpendicular to the bias: 

Wj-y 

Wi = Wj - Wi. (4.36) 

The appropriate order parameters are Wi = Wj-y/y, which can take positive and 
negative values, and Wi = \/wj- Wj, which is positive or zero. 

For any pattern x, the hidden field hi of strategy i now has three components: 

• a stochastic component hi = Wj-x that is assumed to be independent from 
the hi of the other options /; 

• a constant component Wj-y = WiU; 

• and a stochastic component w^-x = hcWi where the factor he is a Gaussian 
variable of variance 1 that takes the same values for all strategies I for a 
given X. 

The chance that hi for a given i is the largest now reads (analogous to 4.14): 



Dh, / Dh,— n * — i'^i^^i + (^c + y)iWi - Wi) ) 



(4.37) 



The integrals can possibly be eliminated by linear transformation of the variables; 
however, the transformed expression neither becomes prettier nor gives more 
insight. 

If the bias dominates the output, a second look at the update rule is necessary. 
If Eq. (4.19) is used and the weight of an option has a large negative bias Wi, the 
hidden field will be negative, and the update will tend to make it more negative 
if the option is favorable. This is not a problem if the output follows the maximal 
\hi\\ however, it is now also possible to let the output follow the maximal hi (Eq. 
(4.9)) and drop the sign (/i;) -term in the update rule: 

wr^ = w* + ^XQ,-. (4.38) 

The normalizing term proportional to — is not needed anymore either - it 
would result in adding the same update vector to all w;, thus shifting all hidden 
fields by the same amount for a given pattern vector. This no longer has an 
influence on the decision of the network. 
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Figure 4.7: Learning from biased patterns: while the long-time averaged strategy 
converges to a* quickly for large bias y (as in Fictitious Play), the current strategy 
comes closest to a* for intermediate y. 



The development of the two order parameters wi and wi — hi/y is straight- 
forward: 

^ = (4-39) 

^ = r,yY,h,ci,^r^y\t. (4.40) 
i 

Interestingly, rule (4.38) implements a "soft" fictitious play rule. This can 

be seen in the hmit y — > oo, where the stochastic component can be neglected 
completely. At time t, the strategy i* with the largest w\ is played, while the 
opponent plays j*; w\ at time t can be written as 

t 

where hj and \f are analogous to Section 4.4.4. By adjusting the range of y 
between and cxo, it is possible to interpolate between random guessing and 
purely deterministic fictitious play. If y takes intermediate values, there is enough 
stochasticity to get, at any given time, a mixed strategy that is reasonably close 
to a*; however, the convergence to a* in the long-time average is slower than for 
larger bias y. This is shown in Fig. 4.7. 

Multi-state patterns 

The same ideas can be generalized to Q-statc input patterns and weight vectors. 
It will become clear how unequal probabilities of the Q inputs amount to a bias. 
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As mentioned, each output I now has a set of Q weight vectors w^^ with entries 
wf^. The hidden field for output I is 

Q N Q 

= - 1) = E^' ^it^ ^1 = (4.42) 

where the are vectors with components QS^^^g — 1. One can then assume that 
the bias in the pattern is the same for all of its components (this is necessarily 
the case with time series patterns, since each value is shifted through the entire 
weight vector) and is thus completely defined by the probabilities of encountering 
the possible inputs q. We then use the deviations 5q from the uniform distribution 
to describe this distribution: 

prob(x„ = ?) = ^(l + 5g). (4.43) 

Note that 6q = 0. We then proceed by splitting into an average and a 
random component: 

( Sq\ 

: ;x« = x«-y«. (4.44) 
\SqJ 

The hidden field /i^ can be split into three parts analogous to the case of real- 
valued input patterns: 

h'i^h1 + wfhl + hi (4.45) 

where hi — w^-y'^ — wfy'^ is the constant part, h^ — w^^-x^/ty^^ is a random 
variable that is the same for all /, and hi = w^^-x'' is random and different for 
each I (the hj are uncorrelated if the Wj are pairwise orthogonal, which I assume 
for simplicity's sake). 

The variance (hl^) of hi can be calculated in a straightforward manner; the 
result is 

(hf) = Q{1 - (1 + Sq)/Q){l + 5q). (4.46) 
Similarly, the variance of hi is 

{hf) = wfQ{l - (1 + 6q)/Q){l + Sq). (4.47) 

The addition of the fields h'^ belonging to different q is almost, but not quite 
straightforward. The total field can be written hke this: 

hi = hi + zihct + hi- (4.48) 

hi is simply the sum of the constant components hi- The second term is based 
on the assumption that wf is approximately proportional to y'^ for all q, with a 
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coefficient cfiaracteristic for I: wf — ziy^. Tfiis assumption will be justified later. 

The sum Ylq'^i^t '^^^ then be written as Zikd, where is a Gaussian variable 
with a variance of {hl^) — J^qV'^^Qi^ ~ (1 + ^q)/Q){^ + ^Q)- Again, hd is the 
same for all 

The components of the strategy vector a can now be calculated in analogy to 
Eq. (4.37): 



/OO poo 



'wihi + {y^ + hc){zi - zi) 



Wi 



(4.49) 



Three questions remain: what is zp. What is all this splitting of the hidden fields 
good for? And what relevance does this have for time-series patterns? 

The first question can be answered by a look at the learning rule, which, in 
analogy to Eqs. (4.35) and (4.38), can be taken as 

wf = wf* + ^x%, (4.50) 

where j is the opponent's action at time t. The average update of the weight 
component parallel to the bias is then 



(wf*+\ = wf* + (^y%A . (4.51) 



The order parameter then follows the differential equation 

dw'i 



'I 



da 



which can be solved: 



w 



vy' Yl ^i^^^- = ^^y'^^ ^' (4-^2) 



f^wl + y^(^r}J\f{T)dTy (4.53) 



Neglecting the initial weight Wl^, the characteristic quantities Zi arc therefore 
proportional to the average payoffs of option I, integrated from the beginning of 
the game. 

Regarding the second question, what is this good for? Eq. (4.49) states the 
following: The bias in the pattern creates a strong preference for those options 
that have accumulated higher zi, i.e., higher expected payoffs in the course of the 
game. This preference can be overruled by either a fluctuation in the pattern 
(if the total pattern x = x + y has a negative overlap with the bias y, the bias 
accumulated in the weights works in the opposite direction) or by the contribution 
from the weight and pattern perpendicular to the bias. For sufficiently large bias, 
either case becomes unlikely, and the learning process again reduces to Fictitious 
Play. 



108 



As for time-series patterns, the relevance is this: if the strategy played by 
the opponent has a significant bias (for example, if the opponent is playing b*, 
where roughly half of the options do not appear at all), some of the 5q are 



significantly different from Fictitious play can be expected - if the opponent's 
output is properly randomized. If two networks with equivalent architecture 
use each other's outputs as input patters, both lock into short cycles for many 
repetitions, never coming close to anything like equilibrium. 

4.6 Summary 

The scenarios presented in this chapter have shown that a slightly modified 
Winner- Takes-All Perceptron with a learning algorithm inspired by the Con- 
fused Bit Generator can be used to learn a good strategy in zero-sum games: it 
concentrates on the optimal pure strategy if there is one, and finds a good mixed 
strategy even against adaptive opponents. The ingredients for success are a drift 
term which strengthens options whose payoff is better than the current one, and 
weakens the others, and a noise term which gradually increases the norms of the 
weight vectors, decreasing the impact that a single learning step has. 

This learning algorithm can be expected to converge to Nash equilibrium 
in the limit of infinitely slow adaption (M — > oo) and infinite learning times 
(a — > oo), and gives a reasonable approximation of equilibrium for finite M and 
a. It is not "rational" - on the other hand, it generates a truly mixed strategy, as 
opposed to algorithms like Fictitious Play, which generate mixed strategies only 
in the long-time average. 

Finding the equilibrium mixed strategy a* in a zero-sum game is a nontrivial 
problem of stochastic optimization: the "energy landscape" becomes increasingly 
flat as one's opponent comes closer to his equilibrium strategy b*; however, only 
then is a* the best response. Considering this difficulty, the success of he pre- 
sented learning algorithm is a pleasant surprise, even if learning is slow. 

A different learning algorithm is required if there is a bias in the patterns. 
This bias leads to a bias in the weight vectors, and the combination of these 
causes a bias in the hidden fields of the neural network. This can be exploited: it 
is now the overlap of the weight vector with the bias in the patterns, rather than 
the norm of the weight vector, which gives the preference for one or the other 
option. Depending on the strength of the bias, one gets pure guessing in the hmit 
of small bias, fictitious play in the limit of infinite bias, and "soft" fictitious play 
in intermediate regimes. 

So far, no scenario has been found where the neural network could explicitly 
make use of any actual information encoded in the patterns. If time series of the 
opponent's output are used for patterns, and both players use networks, they lock 
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into short cycles. More work is necessary to decide whether this is a systematic 
problem of the used network architecture, and whether it can be remedied with 
a different learning algorithm. 
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Chapter 5 

Conclusion and outlook 



The projects presented in this dissertation tried to make some connections be- 
tween the fields named in its title: neural networks, game theory, and time series 
generation. Neural networks played games, time series were generated by games, 
networks learned and generated time series, etc.. The concept that connects these 
fields is that of prediction: neural networks can be used as prediction algorithms, 
time series generation and prediction are so closely related that they are almost 
interchangeable, and, last but not least, economic game theory deals with the 
attempt to predict an opponent's actions as accurately as possible and not allow 
him to predict and outplay oneself. 

The aim of Chap. 2 was to study the properties of time series that are an- 
tipredictable for three selected prediction algorithms, and to find common aspects 
between them, such as a tendency towards longer cycles than for perfectly pre- 
dictable series, and a suppression of the features that the algorithm is sensitive 
to. 

Chap. 3 gave an overview over some of the variations of the Minority Game, 
some of which were introduced first by our research group. It was demonstrated 
that this game naturally gives rise to antipredictability in the time series of 
minority decision if agents try to adapt rapidly. I also demonstrated that, with 
minor adaptions, all conventional strategies can also be generalized to more than 
two options. 

In Chap. 4, a neural network tried to find a suitable strategy for a zero- 
sum game by learning from experience. A learning algorithm was found that 
uses random input patterns to generate a mixed strategy that can come close 
to the equilibrium strategy. This algorithm probably can still be optimized by 
a more systematic analysis what information can be extracted from watching 
the opponent's play. However, it is not yet obvious under what circumstances a 
learning algorithm should play a mixed strategy at all. 

By its conception, this work is interdisciplinary, and a reader with a back- 
ground in physics may have wondered, "Is this physics?" . The answer, depending 
on one's point of view, is no, yes, yes, and maybe. 
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This dissertation does not deal with the classical subjects of physics, namely, 
the properties of inanimate matter. In that sense, it is not physics. 

A different definition might be, "physics is what physicists do" . In that sense, 
it is physics: all of the co-workers who collaborated on the various projects were 
trained as physicists, as are the majority of scientists who have published on the 
statistical properties of neural networks and on the Minority Game. 

The third answer concerns the methods: most of the methods used to achieve 
results in this work, like averaging over disorder in online learning, the application 
of Markov chains in the stochastic MG, the analysis of the nonlinear properties 
of the CSG, and the extensive use of Monte Carlo simulations to check analytical 
results (or replace them where they are not available), are well- rooted in statistical 
physics and nonlinear dynamics. 

The fourth answer should really be, "who cares, as long as it is interesting?" 
If methods from one field can help to get insights in other fields, they should be 
applied, no matter how the result has to be classified. 

I hope that this work has shed some light on the problems it treated from 
unusual angles. What must be kept in mind is that all models presented here 
are toy problems that may or may not have a strong tic to real life. Many of 
these toy problems develop a life of their own, intriguing researchers with math- 
ematical subtleties and technical challenges. I have tried to avoid this tendency 
by presenting and comparing several algorithms and strategies and finding global 
aspects that connect them. 

What remains to be done is to establish which of these models has what 
relevance to "real life", i.e., the behavior of humans, or animals, or other entity: 
what strategies do stock brokers, car drivers, predatory animals use when they 
make their decisions in Minority Game-like settings? How do humans adapt their 
strategy in games? This will necessarily require much know-how on sociology, 
ecology or psychology and much experimental work, such as that begin in Ref. 
[100]. It may thus not be the answer that a theoretical physicist is well-equipped 
to answer. 
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Appendix A 
Notation 



In a work that deals with a rather large number of projects, concepts and math- 
ematical objects, a certain overloading of symbols like a, A and N is inevitable. 
This chart might be helpful to find out what meaning a given symbol has in each 
chapter. 



A, B players in a zero-sum game 

a, b strategy vectors of players A and B 

a*, b* optimal (Nash-equilibrium) strategies 

C Chap. 2: autocorrelation function; 

Chap. 4: norm of center-of-mass of perceptrons. 
C Chap. 3: correlation matrix for multi-choice perceptron; 

Chap. 4: payoff matrix. 
h hidden fields of neural networks. 

H Chap. 2: energy in the Bernasconi Model; 

Chap. 3: information generated in the standard MG. 
K Sec. 3.6: state of the stochastic MG; 

K, L Chap. 4: number of options available to A and B; 

I Chap. 2: cycle length. 

M memory length of prediction algorithm. 

M Sec. 2.3.5: Jacobi matrix of the CSG; 

Sec. 2.4.4: transition probabilities on DeBruijn graph. 
N Chap. 2: number of nodes in a graph; 

Chap. 3: number of players in the MG. 
p Sec. 3.6: probability of changing sides. 

Pap antipersistence parameter (see Sec. 2.4.5) 

Ph, Ph Sec. 2.4.2: probability of hitting a cycle. 

Q number of options in multi-choice MG; 

number of different inputs/outputs in multi-choice perceptrons 
r Sec. 3.4: relative vectors. 

R Sec. 3.4, 3.7.2: overlap of weight vectors; 
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Sec. 3.7.4: self-overlap of strategy vector; 

Chap. 4: normalized overlap between a and a*. 
s Sec. 2.4.5: success rate of prediction algorithm; 

S Sec. 2.3.2: r.m.s. output of the CSG. 

Chap. 3: decision of the minority, 
w* parameters of a prediction algorithm at time t. 

specifically: weight vector of a perceptron. 
W, W Sec. 3.6: transition matrix/ kernel. 

X Sec. 3.6: proportionality constant p = 2x/N. 

X* value of a time series. 

X* vector with components x^~^~^, . . . x^'^. 

a neural networks: rescaled learning time: a = t/M. 

Bernasconi problem: ratio p/M between length of sequence and memory; 

standard MG: ratio p/N between length of decision table. 

and number of players. 
P amplification of continuous perceptron. 

7 rescaled amplification Prj of continuous perceptron. 

r) learning rate. 

A Chap. 2: Lyapunov exponents of the CSG; 

Chap. 4: expected payoff. 
TT, TT Sec. 3.6: state vector/ function. 
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